Advances and challenges in learning from experience replay

Neves, Daniel Eugênio; Ishitani, Lucila; do Patrocínio Júnior, Zenilton Kleber Gonçalves

doi:10.1007/s10462-024-11062-0

Advances and challenges in learning from experience replay

Open access
Published: 20 December 2024

Volume 58, article number 54, (2025)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Advances and challenges in learning from experience replay

Download PDF

Daniel Eugênio Neves¹,
Lucila Ishitani¹ &
Zenilton Kleber Gonçalves do Patrocínio Júnior¹

1081 Accesses
Explore all metrics

Abstract

From the first theoretical propositions in the 1950s to its application in real-world problems, Reinforcement Learning (RL) is still a fascinating and complex class of machine learning algorithms with overgrowing literature in recent years. In this work, we present an extensive and structured literature review and discuss how the Experience Replay (ER) technique has been fundamental in making various RL methods in most relevant problems and different domains more data efficient. ER is the central focus of this review. One of its main contributions is a taxonomy that organizes the many research works and the different RL methods that use ER. Here, the focus is on how RL methods improve and apply ER strategies, demonstrating their specificities and contributions while having ER as a prominent component. Another relevant contribution is the organization in a facet-oriented way, allowing different perspectives of reading, whether based on the fundamental problems of RL, focusing on algorithmic strategies and architectural decisions, or with a view to different applications of RL with ER. Moreover, we start by presenting a detailed formal theoretical foundation of RL and some of the most relevant algorithms and bring from the recent literature some of the main trends, challenges, and advances focused on ER formal basement and how to improve its propositions to make it even more efficient in different methods and domains. Lastly, we discuss challenges and open problems and present relevant paths to feature works.

Hindsight-Combined and Hindsight-Prioritized Experience Replay

Practical Recommendations for Replay-Based Continual Learning Methods

Extending the Capabilities of Reinforcement Learning Through Curriculum: A Review of Methods and Applications

Article 29 October 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning is an artificial intelligence research field in which mathematical and statistical models applied through computational algorithms seek to provide machines with intelligent behavior and the ability to learn from their experiences. In particular, Reinforcement Learning (RL) research is dedicated to developing intelligent computational entities called agents, which try to learn an action policy to interact with the environment and perform a given task, i.e., each agent action on an arbitrary environment state results in a new state and a reward signal and the objective is to learn an optimal policy that maximizes the total expected reward in the long run. If the agent learns an optimal action policy, it will have learned to perform the task. Based on the formal framework of the Markov Decision Process (MDP), different RL methods have been proposed from their early formulations in the 1950s to deal with problems involving decision-making sequences. These processes run by defining value functions to measure the agent choices, either using dynamic programming in tabular environments to find exact solutions for discrete action values or approximating these value functions in high-dimensional state spaces with discrete or continuous action values.

Temporal Difference Learning (TD-learning) was an essential formulation for developing trial-and-error-based learning methods. It allows the agent to adjust itself from a target value based on the reward estimate on a resultant state in the previous interaction, making it possible for many recent algorithms to converge to an optimal policy even if acting sub-optimally, as long as it keeps updating its value function. Moreover, this was the basis for a whole class of off-policy and model-free methods. To cope with some efficiency issues inherent to TD-learning and make TD-based approaches more data efficient, Lin (1992) proposed a fundamental technique called Experience Replay (ER), consisting of reusing previous agent experiences (i.e., previous state transitions) in updating its value function by storing and uniformly sampling them from a replay buffer. This strategy brought unique benefits related to using artificial neural networks in approximating value functions because it decorrelates the agent training data.

A wide range of modern methods employ ER. At the same time, many authors seek to improve it by investigating how to make better use of the replay buffer, how to sample better experiences for agent learning, how to deal with a size-limited buffer (and how big it should be), how to model that buffer and many other questions. Due to its benefits for data-efficient reinforcement learning, research on the use and improvement of ER increased exponentially from 2016, with a disproportionate volume of publications compared to previous years. Despite its relevance and the growing number of publications and methods that use it in some way, we have not found a paper in the last two years dedicated to a literature review specifically about the evolution and application of ER techniques. Zhu et al. (2024) present a survey about multi-agent deep reinforcement learning with communication (Comm-MADRL) focusing on agents’ communication processes. Hickling et al. (2023) present a review of methods and applications to explainability in deep reinforcement learning. Shen and Zhao (2024) review the task construction settings and the application of RL for various natural language processing problems. Like these, other review works present evaluations or applications of classes of RL methods (e.g., in Peng et al. (2024), Elharrouss et al. (2024), Mishra and Arora (2024)), but do not have ER as the central focus in an extensive review. Mckenzie and Mcdonnell (2022) present a review focused on the progression of value-based Reinforcement Learning in the last five years until its publication. They highlight the various and diverse algorithm changes in different aspects, including ER, which figures among many other factors, techniques, and strategies discussed in the review. Despite this, the authors emphasize advances in the recurrent experience replay in the distributed reinforcement learning algorithm.

Therefore, this work aims to review from the early bases until current reinforcement learning methods to formally understand and compare how they use ER and how it makes them more data efficient. Moreover, we seek to contribute to understanding its fundamental ideas and highlight its many theoretical and empirical open problems still being investigated to organize and point out possible future works and research directions. In this way, we are narrowly interested in those works that propose changes or new methods using ER and especially interested in those works that investigate the ER itself and delve into its theoretical issues and empirical investigations. One of the main contributions of this work is a taxonomy that organizes the many research studies and their different methods. It focuses on how they improve and apply experience replay strategies, evidentiating their specificities and contributions, and having the ER as the central topic. Another relevant contribution is how we organize knowledge in a facet-oriented way, allowing different perspectives of reading, whether based on the fundamental problems of RL, focusing on algorithmic strategies and architectural decisions, or with a view to different applications.

This work is organized as follows. In Sect. 2, we present and discuss the theoretical background and some related work. Section 3 presents the Experience Replay foundations. Section 4 presents relevant deep reinforcement learning methods that focus on how they use ER and the differences in their propositions to explore agents’ experiences. Section 5 discusses some of the main research challenges and trends. Section 6 discusses recent research in Experience Replay and some direction for future works. Section 7 presents a structured summary of the research works, methods, and challenges discussed in this extensive review. Finally, we draw some conclusions in Sect. 8.

2 Background

Reinforcement Learning uses the formal framework of the Markov Decision Process (MDP) to define the interaction between a learning agent and its environment regarding states, actions, and rewards. An MDP is defined by a tuple $(\mathcal {S}, \mathcal {A}, \mathcal {P}, R, \gamma )$, so that: $\mathcal {S}$ represents a set of states; $\mathcal {A}$ is a set of actions $\mathcal {A}$ $=\{a_1,a_2,\ldots ,a_n\}$; $\mathcal{{P}}(s'\,\vert \,s, a)$ is the probability of transiting from state s to $s'$ ($s, s'\in S$) by taking action $a\in A$; $\mathcal {R}$ is a reward function mapping each state-action pair to a reward in $\mathbb {R}$; and $\gamma \in [0, 1]$ is a discount factor. A policy $\pi$ represents the agent’s behavior, and the value $\pi (a\,\vert \, s)$ represents the probability of taking action a in state s. At each time step t, the agent observes a state $s_t \in \mathcal {S}$ and chooses an action $a_t \in \mathcal {A}$ that determines the reward $r_t = \mathcal{{R}}(s_t, a_t)$ and the next state $s_{t+1} \sim \mathcal{{P}}(\cdot \,\vert \,s_t, a_t)$, causing a transition of states $T(s,a,s')$. A discounted sum of future rewards is called return $R_t = \sum _{t'=t}^{\infty } \gamma ^{t'-t} r_{t'}$. The agent aims to learn (or approximate) an optimal policy $\pi ^*$ that maximizes the expected long-term (discounted) reward value. These processes imply nondeterministic search problems and stochastic decision sequences for selecting actions from observing each state of the environment resulting from a previous decision. In this way, each agent’s action determines the immediate reward and, more importantly, influences subsequent environment states and future rewards. While the immediate reward informs about the result of an action performed in the current state, the long-term expected reward allows evaluation of the action policy by a value function.

Bellman formulated the MDPs as a stochastic version of the optimal control problem and described two value functions using the concept of states of a dynamical system. The state-value function $v_{\pi }(s)$ seeks to estimate the expected total (discounted) reward value when the agent starts in state s and follows the policy $\pi$. Describing $v_\pi (s)$, one can note (see Eqs. 1–4) the expected sum of future rewards for the states reached by adopting the policy $\pi$ and performing the sequences of state transitions. In Eq. (5), it is clear the dynamic nature of the computation of $v_\pi (s)$, and in Eq. (6) we have the discount value $\gamma$ on the total expected reward.

$$\begin{aligned} v_{\pi }(s)&= \mathbb {E}_{\pi }[r_{1}+r_{2}+...+r_{T} \,\vert \, s_{t} = s]&\end{aligned}$$

(1)

$$\begin{aligned}&= \mathbb {E}_{\pi }[r_{t}] + \mathbb {E}_{\pi }[r_{t+1}+r_{t+2}+...+r_{T} \,\vert \, s_{t} = s]&\end{aligned}$$

(2)

$$\begin{aligned}&= \sum _{a}\pi (s,a)R(s,a) + \mathbb {E}_{\pi }[r_{t+1} + r_{t+2} +\dots + r_{T} \,\vert \, s_{t} = s]&\end{aligned}$$

(3)

$$\begin{aligned}&= \sum _{a}\pi (s,a)R(s,a) + \sum _{a}\pi (s,a)\sum _{s'}T(s,a,s')\mathbb {E}_{\pi }[r_{t+1}+\dots +r_{T} \,\vert \, s_{t} = s']&\end{aligned}$$

(4)

$$\begin{aligned}&= \sum _{a}\pi (s,a)R(s,a) + \sum _{a}\pi (s,a)\sum _{s'}T(s,a,s')v_{\pi }(s')&\end{aligned}$$

(5)

$$\begin{aligned}&= \sum _{a}\pi (s,a)\left[ R(s,a) +\gamma \sum _{s'}T(s,a,s')v_{\pi }(s')\right] \end{aligned}$$

(6)

In the form presented by Sutton and Barto (2018) in Eq. 7, the state-value function $v_\pi (s)$ demonstrates the notions of probability and transition, making the relationship explicit between the value of a state and the values of its successor states. In turn, the action-value functions $q_{\pi }(s,a)$ seek to estimate the total expected reward if the agent takes an action a in the state s and follows the policy $\pi$, allowing it to assess the utility of each possible action at that state (Eq. 8).

$$\begin{aligned} v_\pi (s)&= \sum _{a}\pi (a\,\vert \, s)\sum _{s',r}p (s',r\,\vert \, s,a)\left[ r+\gamma v_\pi (s')\right]&\end{aligned}$$

(7)

$$\begin{aligned} q_{\pi }(s,a)&= R(s,a)+\gamma \sum _{s'}T(s,a,s')\left[ \sum _{a}\pi (s',a')q_{\pi }(s',a')\right]&\end{aligned}$$

(8)

In MDPs, a policy $\pi$ is better or equivalent to another policy $\pi '$ if $v_\pi (s)\ge v_{\pi '}(s), \forall s \in S.$ In all cases, at least one optimal policy $\pi ^*$ is better than or equal to all others. These policies share the same optimal state-value function $v^*(s) = max_\pi v_\pi (s)$, which is the highest value that can be obtained for each state, and the same optimal action-value function $q^*(s, a) = max_\pi q_\pi (s, a)$, $\forall s, a \in S, A$ (Sutton and Barto 2018). It is possible to write the optimal action-value function relative to the optimal state-value function so that $q^*(s, a) = \mathbb {E}[R_{t+1}+\gamma v^*(S_{t+1}) \,\vert \, S_{t}=s, A_{t} = a]$. Since $v^*(s)$ is an optimal state-value function, Bellman’s equation demonstrates that the value of a state under an optimal policy must be equal to the expected return for the best action in that state:

$$\begin{aligned} v^*(s)&= max_{a\in A(s)}q_{\pi *}(s,a)&\nonumber \\&= max_a\sum _{s',r}p(s',r \,\vert \, s,a)[r+\gamma v^*(s')] \end{aligned}$$

(9)

In turn, the Bellman optimality equation for the action-value function can be defined as follows:

$$\begin{aligned} q^*(s,a)&= \mathbb {E}[R_{t+1}+\gamma max_{a'}q^*(S_{t+1},a') \,\vert \, S_{t} = s, A_{t}=a]&\nonumber \\&= \sum _{s',r} p(s',r \,\vert \, s,a)[r+ \gamma max_{a'}q^*(s',a')] \end{aligned}$$

(10)

From $v^*$, one can find $\pi ^*$ and vice versa, both of which are solutions for MDPs. For each state, one or more actions will produce the maximum value in Bellman’s equation, and any policy that maximizes $v^*$ will be optimal. While knowledge of the optimal state-value function $v^*$ makes it possible to search for the optimal policy $\pi ^*$, knowing the optimal action-value function $q^*$ makes it easy to choose optimal actions. This way, for any state s, the agent only needs to find one action that maximizes the value of $q^*$ because it effectively stores the results for all searches one step ahead. It gives the optimal expected return as a local and immediately available value for each state-action pair. It allows one to select optimal actions without knowing the possible successor states with their respective values or, in other words, the dynamics of the environment.

Bellman’s equations allow for finding optimal policies. Still, they are rarely used in practice, as they demand exhaustive searches in the space of states and actions besides assuming that the dynamics of the environment are precisely known, which is not always true. It imposes limitations on a class of methods known as Dynamic Programming (DP) (Szepesvári 2010), which can converge to optimal policies with exact solutions and provide the basis for understanding several other reinforcement learning methods since many of them consist of attempts to achieve the same results but at a lower computational cost, and without the need for a perfect model of the environment (Sutton and Barto 2018). The main idea of DP is to use value functions to search for optimal policies, assuming that the environment is described as an MDP, the sets of states, actions, and rewards are finite, and there is a probability function that describes the environment’s dynamics. In this way, DP can compute the value functions, transforming the Bellman equations into updating rules. In this sense, there are four main related algorithms: policy evaluation, policy improvement, policy iteration, and value iteration.

The policy evaluation method is an iterative solution that uses the state-value function. For a sequence of functions $\{v_{0},\ldots , v_{k}\}$ mapping the states to values, the value for $v_{0}$ is arbitrarily chosen and updated from the values computed in subsequent iterations using the Bellman equation for $v_{\pi }$ as an update rule, in which the state-value function at iteration $v_{k+1}(s)$ considers the expected discounted return obtained for the next possible state $s'$ in the previous iteration, for every state $s \in S$. It is possible to demonstrate that the sequence of value-functions $v_{k}$ converges to $v_{\pi }$ when $k\rightarrow \infty$. At each iteration, to produce $v_{k+1}$ from $v_{k}$, the algorithm applies the same operation to each state s, assigning a new value obtained from the previous values of the state $s'$, a successor of s, and the expected immediate reward for each possible transition under the policy in evaluation. In this way, each iteration updates the value of each state to produce a new approximation of the state-value function $v_{k+1}$:

$$\begin{aligned} v_{k+1}(s)&= \mathbb {E}_{\pi }[R_{t+1}+\gamma v_{k}(S_{t+1})\,\vert \, S_{t}=s]&\nonumber \\&= \sum _{a}\pi (s\,\vert \, a)\sum _{s',a}p(s',r \,\vert \, s,a)[r+\gamma v_{\pi }(s')] \end{aligned}$$

(11)

Even having determined the state-value function $v_{\pi }$ for a policy $\pi$, it is still possible to check whether or not it would be better to select a given action a different from what determined the policy in that state. One way to answer this question is to compute the result for the action-value function $q_{\pi }(s, a)$, introducing a policy improvement step into the policy evaluation method:

$$\begin{aligned} q_{\pi }(s,a)&= \mathbb {E}[R_{t+1} + \gamma v_{\pi }(S_{t+1}) \,\vert \, S_{t}=s, A_{t}=a]&\nonumber \\&=\sum _{s',a}p(s',r \,\vert \, s,a)[r+\gamma v_{\pi }(s')] \end{aligned}$$

(12)

The policy will change if $q_{\pi }(s,a) > v_{\pi }(s)$, as it will be better to choose the action a in the state s and then follow the policy $\pi$ instead of following $\pi$ all the time. So, it is expected that it will be better to select the action a every time the state s is found and that this new policy will be the best overall. Therefore, it is possible to consider changes in all states for all possible actions in a greedy strategy, selecting in each state the best action according to $q_{\pi }(s, a)$, so that the new policy $\pi '$ is given by:

$$\begin{aligned} \pi '(s)&= argmax_{a}q_{\pi }(s,a)&\nonumber \\&= argmax_{a}\mathbb {E}[R_{t+1}+\gamma v_{\pi }(S_{t+1}) \vert S_{t}=s, A_{t}=a]&\nonumber \\ &= argmax_{a}\sum _{s',r}p(s',r \,\vert \, s,a)[r+\gamma v_{\pi }(s')] \end{aligned}$$

(13)

This is a special case of the policy improvement theorem. Let two policies be $\pi$ and $\pi '$ so that, for all $s\in S$, $q_{\pi }(s,\pi '(s))\ge v_{\pi }( s)$, the policy $\pi '$ must be as good as or better than $\pi$ (i.e. $\pi '$ must have an expected reward greater than or equal to $\pi$), where $v_ {\pi '}(s)\ge v_{\pi }(s).$ This result applies particularly to the original policy $\pi$ and the modified policy $\pi '$. If $q_{\pi }(s, a)>v_{\pi }(s)$, then the modified policy will be better than the original policy. Given a policy and its value function, it is possible to evaluate a policy change in a single state for a given action (Sutton and Barto 2018). Once a policy $\pi$ has been improved, a policy iteration process produces a sequence of improvements until it reaches an optimal policy $\pi ^*$ and an optimal value function $v ^*$, as each new action policy is guaranteed to be better than the previous one unless this is already the optimal policy. Considering that a finite MDP has a finite number of policies, this process must converge to an optimal value function and policy in a finite number of iterations.

Although convergence to the optimal policy and the optimal value function is guaranteed, each policy iteration step includes the policy evaluation, which is also iterative, leading to a computation that requires many scans in the space of states. However, truncating the policy evaluation step is possible without losing the policy iteration convergence guarantee. A special case occurs when the policy evaluation stops after a single scan (i.e., after an update step for each state). This method is called value iteration and is a simple update process that combines policy improvement and short policy evaluation steps, as in Eq. 14, for all $s\in S$:

$$\begin{aligned} v_{k+1}(s)&= max_{a}\mathbb {E}[R_{t+1} + \gamma v_{k}(S_{t+1})\vert S_{t}=s, A_{t}=a]&\nonumber \\&= max_{a}\sum _{s',r} p(s',r\,\vert \, s,a)[r+\gamma v_{k}(s')] \end{aligned}$$

(14)

It achieves faster convergence by interposing multiple policy evaluation scans between each policy improvement scan; meanwhile, its output consists of a deterministic policy $\pi \approx \pi ^{*}$ such that $\pi (s ) = argmax_{a}\sum _{s',r}p(s',r\,\vert \, s, a)[r+\gamma v(s')]$.

According to Sutton and Barto (2018), Temporal Difference Learning (TD) is one of the most relevant ideas of reinforcement learning. It allows learning to occur directly from the agent experience without the need for a model of the dynamics of the environment. It can update estimates based on other learned estimations before reaching a final state, which is a clear advantage over DP methods concerning computational efficiency. There are variations of the TD method regarding the number of steps applied to calculate the temporal difference, called by the acronym $TD(\lambda )$, where $\lambda$ is the number of steps. TD(0) is a particular case and updates the estimate of $v(s_{t})$ for an iteration t using the observed immediate reward r and the estimate of $v(s_{t+1})$ at iteration $t+1$. Thus, it waits only for the next iteration to form a target value $r+\gamma v(s_{t+1})$ and updates the value of $s_{t}$ immediately after the transition to $s_{t+1}$, so that $v(s_{t})\leftarrow v(s_{t})+\alpha [r+\gamma v(s_{t+1}) - v(s_{t})]$, where $\alpha$ is a learning rate, $\gamma$ is the discount value, and $s_{t}$ and $s_{t+1}$ are respectively the environment’s states at iterations t and $t+1$. The difference $\gamma v(s_{t+1}) - v(s_{t})$ is known as the Temporal Difference Error (TDE). Sampling-based updatings, like those used in TD methods, are distinct from those used in dynamic programming methods, as they are based on a single successor state and not on a complete probability distribution over all possible successors. Thus, TD methods are independent of a model of the environment, are naturally implemented online and incrementally, and converge to an optimal policy.

Q-Learning (Watkins and Dayan 1992) is a model-free and off-policy algorithm that applies successive steps to update the estimates for the action-value function Q(s, a) (that approximates the long-term expected discounted reward value of executing an action from a given state) using TD-learning and minimizing the TD-error (defined by the difference in Eq. 15). This function is named Q-function, and its estimated returns are known as Q-values. A higher Q-value indicates that an action a would yield better long-term results in state s. Q-Learning converges to an $\pi ^*$ even if it is not acting optimally every time as long as it keeps updating the Q-value estimates for all the pairs of state-action and generates a variation of the usual stochastic approximation conditions through subsequent changes of $\alpha$, as we can describe in Eq. 15.

$$\begin{aligned} Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha [r +\gamma \max _{a}Q(s_{t+1},a)- Q(s_t,a_t)] \end{aligned}$$

(15)

Applying Q-learning to those problems where the space of states and actions is too large to learn all the actions’ values in all possible states or when these states are multidimensional data, one can achieve a good approximate solution by learning a parameterized value function $Q(s,a,\Theta _{t})$ as in Eq. 16 (Van Hasselt et al. 2016).

$$\begin{aligned} \Theta _{t+1}= \Theta _{t}+\alpha [r_{t+1}+\gamma \max _{a}Q(s_{t+1},a,\Theta _{t})-Q(s_{t},a_{t},\Theta _{t})]\nabla _{\Theta _{t}}Q(s_{t},a_{t},\Theta _{t}) \end{aligned}$$

(16)

One can see that the target function in the TD-error calculation consists of a greedy policy defined by the max function. In this way, the maximum over the estimated values is implicitly used to estimate the highest return value, which can lead to considerable maximization bias. For example, in a state s, there may be many actions for which the actual return value q(s, a) is zero, but the estimated values Q(s, a) may be distributed over negative and positive values. In this case, always taking the maximum introduces a clear positive bias in this set of actions. Such a maximization bias can lead the agent to choose misguided actions more often in a given state. One way to approach this problem is to use two independent estimates, $Q_{1}(s, a)$ and $Q_{2}(s, a)$, of the actual value of q(s, a), $\forall a\in A$. So, we can use $Q_{1}(s,a)$ to determine the action $a^*= argmax_{a}Q_{1}(s,a)$ and $Q_{2}(s,a)$ to provide the estimate $Q_{2}(s,a^*)= Q_{2}(s,argmax_{a}Q_{1}(s, a))$. This way, it is also possible to perform the same process by reversing the roles of $Q_{1}(s, a)$ and $Q_{2}(s, a)$ to obtain a second estimate of reduced bias from $Q_{1}(s,argmax_{a}Q_{2}(s, a))$. This is the approach proposed by van Hasselt (2010) to formulate the method called Double Q-Learning (Eq. 17), in which only one of the estimates is updated at each training step based on a probability value. The authors demonstrated that this approach reduced bias by decomposing the $\max$ operation in the target function into action selection and evaluation, which improved the update of the action-value function, making it more stable by diminishing the overestimation of the Q-values.

$$\begin{aligned} Q_{1}(s,a) \leftarrow Q_{1}(s,a)+\alpha [r +\gamma Q_{2}(s_{t+1}, argmax_{a}Q_{1}(s_{t+1},a))-Q_{1}(s,a)] \end{aligned}$$

(17)

To use parameterized value functions, Double Q-Learning learns two value functions using two different sets of weights $\Theta$ and $\Theta ^{\prime }$ and, at each update step, one set is used to select the action greedily and the other to evaluate it, as defined in Eqs. 18 and 19.

$$\begin{aligned} y_{t}= & r_{t}+\gamma Q(s_{t+1},argmax_{a}Q(s_{t+1},a,\Theta _{t}),\Theta ^{\prime }_{t}) \end{aligned}$$

(18)

$$\begin{aligned} \Theta _{t+1}= & \Theta _{t}+\alpha [y_{t}-Q(s_{t},a_{t},\Theta _{t})] \nabla _{\Theta _{t}}Q(s_{t},a_{t},\Theta _{t}) \end{aligned}$$

(19)

As an agent interacts with stochastic, non-deterministic, and partially observable environments, exploration and exploitation are also essential concepts, and how to balance them is a recurring challenge. Exploration refers to the experimentation of the new and the generation of new knowledge but tends to maximize risks to expand the agent’s knowledge. Exploitation relates to the knowledge assimilated, the maximization of efficiency and performance, the minimization of risks, and the refinement of the knowledge already acquired. In this way, to increase the accumulated reward value, an agent seeks to select actions already experienced and which produced good results, but it is necessary to try new actions to discover new ones that may provide greater rewards, thus choosing between obtaining rewards quickly or having a chance to select better action in the future. The agent must try various actions and progressively pick the best ones. This dilemma is a complex problem and has not yet been exhausted in the literature.

Bellemare et al. (2017) discuss a distributed perspective on the return value in contrast to modeling its expectation and propose an algorithm that applies the Bellman equation to learn from approximate value distributions. Considering that the value function Q estimates the random value return (resulting from the probabilities of the transitions), the authors describe its distributional nature like in Eq. 20, where Z is the value distribution, and R (the reward function) is explicitly a random variable. A stationary policy $\pi$ maps each estate to a probability distribution over the action space.

$$\begin{aligned} \small Z(s, a) {\mathop {=}\limits ^{D}} R(s, a)+\gamma Z\left( S^{\prime }, A^{\prime }\right) \end{aligned}$$

(20)

The authors define the value $Z^{\pi }$ as the sum of discounted rewards along the agent’s interaction with the environment, the value functions as vectors in $\mathbb {R}^{S\times A}$, and consider the expected reward function as one of those vectors. Therefore, they define a Bellman operator $\tau ^{\pi }$ and an optimality operator $\tau$ like in Eqs. 21 and 22, where P is the transition function (as defined in Sect. 2). Instead of expectation, they consider the full distribution of the variable $Z^{\pi }$, which they call value distribution. Moreover, they also discuss the theoretical behavior of the distributional analogs of the Bellman operator in the control setting.

$$\begin{aligned} \small \tau ^{\pi }Q(s,a):= & \mathbb {E}R(s,a)+\gamma \mathbb {E}_{P,\pi }Q(s',a') \end{aligned}$$

(21)

$$\begin{aligned} \tau Q(s,a):= & \mathbb {E}R(s,a)+\gamma \mathbb {E}_P\max _{a'}Q(s',a') \end{aligned}$$

(22)

Finally, the authors present their state-of-the-art results of modeling and applying the distributional value within a DQN agent evaluated on ALE (Bellemare et al. 2013) and demonstrate considerable improvements in agent performance.

3 Experience replay

After an agent has performed a sequence of actions and received a return value, knowing how to assign credit (or discredit) to each state-action pair consists of a difficult problem called Temporal Credit Assignment. Temporal Difference Learning (TD) represents one of the main techniques to deal with this problem, despite being a slow process, especially when it involves credit propagation over a long sequence of actions. For example, Adaptive Heuristic Critic – AHC (Sutton 1992) and Q-Learning (Watkins and Dayan 1992), which represent the first TD-learning-based methods in RL, are characterized by high convergence times. An effective technique called Experience Replay (ER) was proposed by Lin (1992) to speed up the credit attribution process and consequently reduce convergence time by storing agent experiences in a replay buffer and uniformly sampling past experiences to update the agent model. One of its main motivations is that relevant algorithms become inefficient when they use trial-and-error experiences only once to adjust the evaluation functions and then discard them. This is because some agents’ experiences can be rare, while others can be expensive, such as those involving penalties. Moreover, using ER with random sampling reduces the effect of the correlation of the data (which represents the environment states) and improves its nonstationary distribution when using neural networks to approximate the value functions because it softens the distribution over many previous experiences (Mnih et al. 2013). The correlation between consecutive observations of the environment’s states can lead the minor updates (on the approximate model) to generate considerable changes in the policy learned by the agent (Mnih et al. 2015). So it can change the data distribution and the relation between the action-value functions in calculating the TD-error.

Lin (1992) compared Experience Replay to two other techniques: (i) Learning Action Models for Planning and (ii) Teaching, regarding shortening the trial-and-error process. In Learning Action Models for Planning, the agent does not need to learn a model of actions by itself, as in such cases when there is a perfect environment’s model or when it can quickly grasp a good one. In Teaching, lessons from a teacher (i.e., a human player or maybe another agent) demonstrate how to get from an initial to a final state and accomplish the task goal. Those lessons store selected actions, state transitions, and attained rewards. This way, an agent can repeat these lessons several times, similar to what it could do with its own experiences. The author applied the three techniques in eight reinforcement learning frameworks based on AHC and Q-learning: (i) AHCON (Connectionist AHC-learning); (ii) AHCON-R (AHCON using Experience Replay); (iii) AHCON-M (AHCON using Actions Model); (iv) AHCON-T (AHCON using Experience Replay and Teaching); (v) QCON (Connectionist Q-learning); (vi) QCON-R (QCON with Experience Replay); (vii) QCON-M (QCON with Actions Model); and (viii) QCON-T (QCON with Experience Replay and Teaching). These frameworks seek to learn a policy evaluation function approximated by neural networks and adjusted using TD and backpropagation.

According to Lin (1992), using ER in AHCON-R and QCON-R did not improve the agent performance (compared to AHCON and QCON) but speeded up the convergence in all experiments. In turn, using action models in AHCON-M and QCON-M could speed up the convergence time, but this did not happen during the experiments. When comparing AHCON-T and QCON-T with AHCON-R and QCON-R, no significant differences existed in the less complex environments. However, in the complex environments, AHCON-T and QCON-T were considerably faster. Therefore, the author states that the advantage of using Teaching becomes more significant as the task becomes more demanding and has demonstrated the superiority of ER over the Actions Model when agents need to learn a model of the environment by themselves. However, the latter method is superior if a perfect action model is provided. Nevertheless, it does not seem advantageous, especially in problems with large spaces of states and actions, nondeterministic scenarios, and nontabular solutions, where there is no way to provide a perfect action model to the agent considering all possible situations.

Experience Replay has some limitations. As it repeats experiences through uniform sampling, if those samples define policies very different from what the agent is learning, it can underestimate the evaluation and utility functions, which affects methods that use neural networks because whenever it adjusts the weights for a given state, this affects the entire model concerning many (or perhaps all) other states. Besides, the memory of experiences does not differentiate relevant experiences due to the uniform sampling and overwrites many agent experiences due to the buffer size limitation. That points to the need for more sophisticated strategies that can emphasize experiences capable of contributing more to agent learning in the sense of what was proposed by Schaul et al. (2016). Another relevant problem is related to the size of the replay buffer. According to Mnih et al. (2015), all DQN-based methods (some of the most relevant are presented in Sect. 4) used a fixed replay memory size of 1 M transitions. Recently, some research works investigated the effects of small and large buffers (Zhang and Sutton 2017; Liu and Zou 2018). In turn, Neves et al. (2022) used a small dynamic memory to explore the replay of the experience and the dynamics of the transitions, reducing the number of experiences required for agent learning.

4 Experience replay in deep reinforcement learning

Looking at the history of reinforcement learning, some of the most recent and significant improvements have arisen from the possibility of approximating value functions from multidimensional data. At this point, adopting artificial neural networks was a promising proposition deeply investigated in the early related literature, including Lin (1992). However, the proposal of using Convolutional Neural Networks to approximate the action-value functions together with Experience Replay was a game-changing approach presented by Mnih et al. (2013). From that, and because ER reduces nonstationarity and decorrelates the agent’s updates, contributing to the stabilization when using deep neural networks, the recent research in the literature has been working based on these two fundamental findings and bringing many relevant discovering mainly in this, but not only, relevant aspects: (i) approximating value function; (ii) reducing bias; (iii) composing better value functions; (iv) improving data efficiency; and (v) dealing with continuous-valued action spaces. Therefore, this section starts the discussions regarding ER in Deep RL, supported by some fundamental works and other new studies about these relevant aspects.

4.1 Variations on convolutional neural networks in Q-learning and double Q-learning-based approaches

Deep Q-Network (DQN) (Mnih et al. 2013) and Double Deep Q-Network (DDQN) (Van Hasselt et al. 2016) are two relevant methods based on Q-Learning and Double Q-Learning with ER. These methods achieved state-of-the-art results and human-level performance in learning to play a set of Atari 2600 games emulated in the Arcade Learning Environment (ALE) (Bellemare et al. 2013; Machado et al. 2018), which allows complex challenges for RL agents such as non-determinism, stochasticity, and exploration. To approximate the action-value functions, the authors used Convolutional Neural Networks (CNN) on the representations of environment states obtained from the video game frames without any prior information regarding the games, no manually extracted features, and no knowledge regarding the internal state of the ALE emulator. Thus, agent learning occurred only from video inputs, reward signals, the set of possible actions, and the final state information of each game. The authors attributed their state-of-the-art results mainly to the ability of their CNN to represent the games’ states.

Mnih et al. (2015) improved DQN by changing how CNN is used, as shown in Algorithm 1. Instead of using the same network (with the same parameters $\Theta$) to approximate both the action-value function $Q(s, a, \Theta )$ and the target action-value function $Q(s',a, \Theta ),$ the authors used independent sets of parameters $\Theta$ and $\Theta '$ for each network. Thus, only the function $Q(s, a, \Theta )$ has its parameters $\Theta$ updated by backpropagation. The parameters $\Theta '$ are updated directly (i.e., copied) from the values of $\Theta$ with a certain frequency, remaining unchanged between two consecutive updates. Thus, only the forward pass is performed when the network is used with the parameters $\Theta '$ to predict the value of the target function. Specifically, at each time-step t, a transition (or experience) is defined by a tuple $\tau _t = (s_t, a_t, r_t, s_{t+1})$, in which $s_t$ is the current state, $a_t$ is the action taken at that state, $r_t$ is the reward received at t, and $s_{t+1}$ is the state resulting after taking action $a_t$. Recent experiences are stored to construct a replay buffer $\mathcal{{D}} = \{\tau _1, \tau _2,\ldots , \tau _{N_\mathcal{{D}}}\}$, in which $N_\mathcal{{D}}$ is the buffer size. Therefore, a CNN can be trained on samples $(s_t, a_t, r_t, s_{t+1}) \sim U(\mathcal{{D}})$, drawn uniformly at random from the pool of experiences by iteratively minimizing the following loss function,

$$\begin{aligned} \small \hspace{-4pt}\mathcal{{L}}_{DQN}(\Theta _i)= {\mathbb {E}}_{(s_t, a_t, r_t, s_{t+1}) \sim U(\mathcal{{D}})} \left[ \left( r_t + \gamma \max _{a'} Q(s_{t+1}, a', \Theta ') - Q(s_t, a_t, \Theta _i) \right) ^2 \right] , \end{aligned}$$

(23)

in which $\Theta _i$ are the parameters from the i-th iteration. Instead of using the same network, another one provides the target values $Q(s_{t+1}, a', \Theta ')$ used to calculate the TD-error, decoupling any feedback that may result from using the same network to generate its own targets.

The Algorithm 2 addresses the trade-off between exploration and exploitation through an $\epsilon$-Greedy strategy which selects an action from a homogeneous distribution over the set of possible actions with a given probability (exploration); otherwise, it uses the CNN that approximates the Q(s, a) function to select the action that maximizes the estimation of the Q-value (exploitation). Generally, the value of the hyperparameter $\epsilon$ decreases over time, causing the agent to explore a lot at first but gradually transiting to use more and more of the acquired knowledge.

One can note a bias in DQN, such as in Q-Learning. According to Van Hasselt et al. (2016), it can lead to overestimated high action values. Still, these values would not be a problem if all action values were uniformly higher, which probably does not occur. However, it is more likely that overestimation is common during learning, mainly when action values are inaccurate. In this way, the real problem is if that overestimation is not uniform and rises more often from state-action pairs that lead to suboptimal policies. The authors showed the occurrence of overestimations in DQN and proposed the DDQN based on Double Q-Learning. Since Double Q-Learning learns two value functions using two different sets of weights $\Theta$ and $\Theta ^{\prime }$, it is possible to compare Q-Learning to Double Q-Learning, rewriting its target value to untangle action selection and evaluation – Eqs. 24 and 25.

$$\begin{aligned} Y^{Q}_{t}= & R_{t+1} + \gamma Q(S_{t+1},argmax_{a}Q(S_{t+1},a,\Theta _{t}),\Theta _{t} ) \end{aligned}$$

(24)

$$\begin{aligned} Y^{DoubleQ}_{t}= & R_{t+1} + \gamma Q(S_{t+1},argmax_{a}Q(S_{t+1},a,\Theta _{t}),\Theta ^{\prime }_{t} ) \end{aligned}$$

(25)

The authors demonstrated that DDQN (see Algorithm 3) reduced bias using the formulation of Double Q-Learning by decomposing the $\max$ operation in the target function into action selection and action evaluation, which improved the action-value functions updates making it more stable by diminishing the overestimation of the Q-values. The target value changes from Eqs. 26, 27. The update of the target network is performed in the same way as in the DQN, periodically copying the updated weights from the online network, which approximates the evaluation function.

$$\begin{aligned} Y_{t}= & r_t + \gamma \max _{a'} Q(s_{t+1}, a', \Theta ') \end{aligned}$$

(26)

$$\begin{aligned} Y_{t}= & r_t + \gamma Q(s_{t+1},argmax_{a}Q(s_{t+1},a,\Theta ), \Theta ') \end{aligned}$$

(27)

Some research works sought to investigate video frames as sequences in the input of the neural network that approximate the value action function in DQN-based methods to improve the perception of movements, speed up the trial-and-error process, and deal with problems in which agents only observe a reward signal after long sequences of decision making. For Hausknecht and Stone (2015), mapping states to actions based only on the four previous game states (stacking the frames in a pre-processing step) prevents the DQN agent from achieving the best performance in games that require remembering faraway events from a large number of frames, because in those games the future states and rewards depend on several previous states. Therefore, the authors proposed using Long Short-term Memory (LSTM) instead of the first fully connected layer, just after the series of convolutional layers in the original DQN neural network architecture, to use the little history better. The authors demonstrate a trade-off between using a non-recurrent network with a long history of observations or a recurrent network with just one frame at each iteration step. They stated that a recurrent network is a viable approach for dealing with observations from multiple states. However, it presents no systematic benefits compared to stacking these observations in the input layer of a plain CNN. Moreno-Vera (2019) proposed a similar approach using DDQN instead of DQN.

Wang et al. (2016) proposed using a new deep neural network architecture called Dueling Networks instead of classical (well-known) architectures such as CNNs, LSTMs, or MLPs to improve model-free reinforcement learning methods. This architecture can generalize learning across actions without changes to the underlying reinforcement learning algorithm. Their architecture uses two estimators in the same network, one for the state-value function $V(s,\theta ,\beta )$ and another for the so-called state-dependent action advantage function $A(s,a,\theta ,\alpha )$, defined by two streams of fully connected layers (following the convolutional layers) whose outputs are the (scalar) state-value and a vector of advantages for each action. Equation 28 combines these two outputs and produces the final Q-value estimations, in which $\alpha$ and $\beta$ are the parameters of the two streams of fully connected layers, and $\theta$ represents the parameters of the convolutional layers.

$$\begin{aligned} \small Q(s, a, \theta , \alpha , \beta )=V(s, \theta , \beta )+ \left( A(s, a, \theta , \alpha )-\frac{1}{|\mathcal {A} |} \sum _{a^{\prime }} A\left( s, a^{\prime }, \theta , \alpha \right) \right) \end{aligned}$$

(28)

As the output is also Q-value estimates for each action in the input states, the dueling network architecture can replace the original neural networks in other algorithms such as DQN and DDQN, with adaptation only regarding backpropagation. The authors demonstrated improved experimental results using uniform and Prioritized Experience Replay (PER) (Schaul et al. 2016).

4.2 Dealing with continuous-valued action spaces in off-policy RL

The DQN-based methods, such as the DDQN and Dueling Networks, either with original ER or PER, achieved state-of-the-art (at the time of their respective publications) in learning to act directly from high dimensional states, interacting with nondeterministic environments in a stochastic way, and approximating the optimal policy from discrete-valued and low-dimensional action spaces. However, many interesting problems, such as physical control tasks, have continuous (real-valued) and high-dimensional action spaces ($\mathcal {A}=\mathbb {R}^N$). In these cases, for the DQN to find the action that maximizes the Q-value estimate, an iterative optimization process would be necessary at each step of the agent. Therefore, Lillicrap et al. (2016) have based on the Deterministic Policy Gradient (DPG) algorithm (Silver et al. 2014) and the DQN to propose an actor-critic and model-free algorithm called Deep Deterministic Policy Gradient (DDPG) that uses deep neural networks with ER and can learn over continuous action spaces. According to the authors, DDPG can find policies whose performance is competitive (sometimes better) with those found by a planning algorithm with full access to the dynamics of challenging physical control problems that involve complex multi-joint movements, cartesian coordinates, unstable and rich contact dynamics, and gait behavior. They have evaluated their agent in learning action policies from video-frame pixels and physical control data (such as joint angles), using the same hyperparameters and network architecture in different challenges in simulated physical environments through a physics engine originally proposed for model-based control called MuJoCo (Todorov et al. 2012).

From the derivations of the Bellman equation presented in Sect. 2 to define the action-value function in Eq. 8, one can note how the target policy could be described as a function $\mu :\mathcal {S}\rightarrow \mathcal {A}$ if this policy is deterministic, to replace the expectation in the target, as described by Lillicrap et al. (2016), changing from Eqs. 29, 30.

$$\begin{aligned} Q^\pi (s_t,a_t)= & \mathbb {E}_{r_t,s_{t+1}}[r(s_t,a_t)+\gamma \mathbb {E}_{a_{t+1}\sim \pi }[Q^\pi (s_{t+1},a_{t+1})]] \end{aligned}$$

(29)

$$\begin{aligned} Q^\mu (s_t,a_t)= & \mathbb {E}_{r_t,s_{t+1}}[r(s_t,a_t)+\gamma Q^\mu (s_{t+1},\mu (s_{t+1})] \end{aligned}$$

(30)

As in $\mathcal {MDP}$s, the discounted sum of future rewards R depends on the policy $\pi$; the authors denote its distribution over the visited states as $\rho ^{\pi }$. Therefore, it is possible to learn the function $Q^{\mu }$ off-policy using transitions obtained from a different stochastic behavior policy they referenced as $\beta$ and a distribution $\rho ^{\beta }$. Thus, an approximator for the Q-value function parameterized by $\Theta ^Q$ could be optimized by minimizing the loss obtained in Eq. 31.

$$\begin{aligned} L(\Theta ^Q) = \mathbb {E}_{s_t\sim \rho ^{\beta },a_t\sim \beta }[(Q(s_t,a_t\;|\;\Theta ^Q) - r_t+\gamma Q(s_{t+1},\mu (s_{t+1})\;|\;\Theta ^Q)^2] \end{aligned}$$

(31)

The DPG algorithm applies a parameterized function $\mu (s\;|\;\theta ^{\mu })$ (the actor) to define the current policy by deterministically mapping states to actions and updating its parameter using the policy gradient (i.e, the gradient of the policy’s performance) (Silver et al. 2014). This update technique applies the chain rule to the expected return from the start distribution J concerning the actor parameters, as in Eqs. 32 and 33. In turn, it learns the Q-value function Q(s, a) (the critic) using the Bellman equation as in Q-Learning (Lillicrap et al. 2016).

$$\begin{aligned} \nabla _{\theta \mu }J&\approx \mathbb {E}_{s_t\sim \rho ^{\beta }}[\nabla _{\theta \mu } Q(s,a\;|\;\theta ^Q) | s=s_t,a=\mu (s_t\;|\;\theta ^{\mu })]&\end{aligned}$$

(32)

$$\begin{aligned}&= \mathbb {E}_{s_t\sim \rho ^{\beta }}[\nabla _{a}Q(s,a\;|\;\theta ^Q) | s=s_t,a=\mu (s_t) \nabla _{\theta _{\mu }}\mu (s\;|\;\theta ^\mu )|s=s_t] \end{aligned}$$

(33)

DDPG (see Algorithm 4) applies modifications to DPG to use the contribution of DQN in approximating the Q-value function from the high-dimensional states space and to use the policy gradient to deal with high-dimensional and continuous action spaces. It also uses a replay buffer with uniform sampling. One change is in the updating of the target function. Instead of copying the weights directly from the updating to the target neural network (such as in DQN), the authors create a copy of the critic and actor networks, $Q'(s,a\;|\;\theta ^{Q'})$ and $\mu '(s\;|\;\theta ^{\mu '})$, and use them to estimate the target values, then update these target networks by slowly (for stability) tackling the updating neural networks making $\theta ' \leftarrow \tau \theta + (1 - \tau )\theta ' ,$ with $\tau \ll 1$. This way, the authors obtained stable targets $y_i$ to train the critic network consistently. To deal with learning from low-dimensional physical feature vector observations, whose components may have different units and scales, such as position and velocity, the authors applied the batch normalization technique (Ioffe and Szegedy 2015) to the state input and all layers of the $\mu$ network and the layers of the Q network before its action input layer. Because it is off-policy, DDPG can deal with exploration (a difficult problem in continuous action spaces) independently of the learning algorithm. While DQN uses an $\epsilon$-greedy approach (see Algorithm 2), Lillicrap et al. (2016) use an Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein 1930) to add a noise $\mathcal {N}$ to the actor policy and generate temporally correlated exploration, where $\mu '(s_t) = \mu (s_t\;|\;\theta _{t}^{\mu })+ \mathcal {N}$.

4.3 Improving sample data efficiency in experience replay

According to Schaul et al. (2016), an agent can learn more efficiently from some experiences than others, and some experiences may become more relevant as the agent approximates an optimal policy. Moreover, the experience replay from a uniform sampling does not consider the relevance of the experiences for the agent learning, usually repeating them at the same frequency they occurred. Given this, the authors investigated the effects of prioritization of experiences with studies in a proper environment (which presents exploration challenges with rare rewards), resulting in the proposition of the Prioritized Experience Replay (PER) method. The authors initially investigated the effects of prioritizing experiences in reducing the number of updating steps that a Q-Learning agent needs to learn the Q-function, comparing the results using: (i) a uniform sample; (ii) an oracle that achieves the best results; and (iii) a greedy sampling strategy. This greedy strategy stores the last TD-error value along with each transition in the replay memory and replays the ones with the highest absolute value of the TD-error to update the Q-function. They verified that it reduced the number of update steps compared to the uniform sampling but presented several issues. Because it only updates the TD-error of the replayed transitions, it may not replay those transitions initially associated with low TD-error values for a long time or until they are discarded due to a replay buffer with a size constraint. Moreover, replaying experiences with high and slowly decreasing TD-errors often causes a loss of diversity. That may lead the model to overfit, besides being sensitive to noise spikes (e.g., when the rewards are stochastic). Therefore, they proposed a stochastic sampling method that combines greedy prioritization and uniform random sampling by defining the sample probability based on a transition’s priority value. According to Eq. 34, the probability of sampling a transition j is

$$\begin{aligned} P(j)=\frac{p_{j}^{\alpha }}{\sum _{i} p_{i}^{\alpha }}, \end{aligned}$$

(34)

where $p_j > 0$ is the priority of transition j and $\alpha$ defines how much of this value to use (i.e., $\alpha = 0$ to uniform sampling). Nevertheless, experiences’ prioritizing introduces bias because it changes the probability distribution on which stochastic updates depend. Therefore, the authors proposed an approach to correct the bias called importance-sampling through a weight $w_j$ (applied in the Q-function update) given by Eq. 35,

$$\begin{aligned} w_{j}=\left( \frac{1}{N} \times \frac{1}{P(j)}\right) ^{\beta } \end{aligned}$$

(35)

in which N represents the size of the replay buffer and compensates for the nonuniform probabilities P(i) if $\beta =1$. Based on the hypothesis that it is possible to ignore small values of bias since it impacts more as convergence approaches, the authors exploit the flexibility of annealing the amount of importance-sampling correction over time by (linearly from an initial value) making $\beta =1$ only at the final of training. Finally, Schaul et al. (2016) combined prioritizing replay, stochastic sampling with priority values, and importance sampling to define the method PER (see Algorithm 5). They replaced the uniform sample in DDQN, achieving new state-of-the-art results in learning to play Atari 2600 games in ALE.

According to Schaul et al. (2016), the use of a replay buffer presents two main challenges: (i) the selection of which experiences to store; and (ii) the picking of which ones to repeat. They addressed the second when they proposed the PER, assuming that the content of the memory was beyond their control. Novati and Koumoutsakos (2019), Zha et al. (2019), and Sun et al. (2020) also dealt with the second case seeking to make ER more optimized and data-efficient based on how to sample transitions to improve the current learning policy. In turn, Neves et al. (2022) originally approached the first case, investigating how to store the transition in a transitions memory, improving data efficiency, but mainly seeking to exploit rare and expensive experiences. For Novati and Koumoutsakos (2019), the accuracy of the updates can deteriorate when the policy diverges from past behaviors and can undermine the performance of the ER. Instead of tuning hyperparameters to slow down policy changes, they actively reinforced the similarity between current policy transitions $\pi$ and past experiences $\mu$ used to compute updates, using an approach called Remember and Forget Experience Replay (ReF-ER). This skips gradients computed from experiences that are too unlikely with the current policy transitions and regulates policy changes within a trust region of the replayed behaviors. Their main objective is to control the similarity between $\pi$ and $\mu$, classifying experiences as “near-policy” or “far-policy” based on a ratio $\rho$ of probabilities of selecting the associated action with $\pi$ and that with $\mu$. Therefore, ReF-ER limits the fraction of far-policy samples in the replay memory and computes gradient estimates only from near-policy experiences. The authors demonstrated that their approach could be applied to any off-policy method with parameterized policies (i.e., by using Deep Neural Network – DNN) and that it allows for better stability and agent performance (compared to uniform sampling) in the main class of methods for continuous action spaces based on DPG (i.e., DDPG), Q-learning (i.e., NAF in Gu et al. (2016)), and off-policy Policy Gradients (off-PG) (Degris et al. 2012).

Zha et al. (2019) proposed an Experience Replay Optimization (ERO) framework, which aims to optimize the replay strategy by learning a replay policy (instead of applying a heuristic or rule-based strategy) whose main challenge is dealing with continuous, noised and unstable (regarding the rewards) updating of a large replay memory (usually in the tens of millions of transitions). Its objective is learning to sample experiences that could maximize the expected cumulative reward. While the agent learns a policy $\pi : \mathcal {S}\rightarrow \mathcal {A}$, ERO learns a policy $\phi : \mathcal {D}\rightarrow \mathcal {B_{i}}$, where $\mathcal {D}$ is the replay buffer, and $\mathcal {B_{i}}$ is a batch of transitions sampled from $\mathcal {D}$ at a time step i. $\phi$ outputs a boolean vector to guide the subset sampling, indirectly teaching the agent by defining what subset it should use to update its value functions. Then, ERO adjusts $\phi$ according to the return from the environment as a measure of the agent’s performance improvement. The authors evaluated their approach by applying it to train a DDPG agent on eight continuous control tasks from the OpenAI Gym environment. They concluded their proposal is promising because it could find more “usable” experiences for off-policy agents using ER in different tasks.

Sun et al. (2020) proposed the Attentive Experience Replay (AER) to prioritize transitions (at sampling from the replay buffer) containing states more frequently observed by the current policy, based on the idea that some states in past experiences may become rarely revisited once the policy is continually updated and may not contribute to or even harm the overall performance of the current policy. The authors considered the similarity between past and current transition states as a measure of frequency and prioritization criteria. For the authors, some experiences in the replay buffer might become irrelevant to the current policy, and others may contain states that the current policy would never visit. Another supposition from Sun et al. (2020) is that some transitions from the past might contain states that would never be visited by current policy, and optimization over these states might not improve the overall performance of current policy and can undermine the performances of frequently visited states. The authors evaluated the results of applying AER in the off-policy algorithms DQN, DDPG, Soft Actor-Critic (SAC) (Haarnoja et al. 2018), and Twin Delayed Deep Deterministic Policy Gradient (TD3) (Fujimoto et al. 2018), and compared with uniform sampling and PER. They used the OpenAI Gym task ecosystem (Brockman et al. 2016).

Neves et al. (2022) proposed a method named COMPact Experience Replay (COMPER) to improve the model of experience memory and make ER feasible (and more efficient) using smaller amounts of data. The authors demonstrated that it is possible to produce sets of similar transitions and explore them to build a reduced transitions memory, performing successive updates of their Q-values and learning their dynamics through a Long-short Term Memory (LSTM) network. They also used this same LSTM network to approximate the target value at the TD-learning. According to the authors, this augments the odds of a rare transition being observed compared to a sampling performed on a large replay buffer and makes the updates of the value function more effective. The authors presented a complete analysis of the memories’ behavior, along with detailed results for 100,000 frames and about 25,000 iterations with a small experience memory on eight challenging 2600 Atari games in the Arcade Learning Environment (ALE), demonstrating that COMPER can approximate a good policy from a small number of frame observations using a compact memory and learning the similar transitions’ sets dynamics using a recurrent neural network.

COMPER (see Algorithm 6) uses ER and TD-learning to update the Q-value function Q(s, a). However, it does not just construct a replay buffer. Instead, COMPER samples transitions from a much more compact structure named Reduced Transition Memory $(\mathcal{{RTM}}).$ To achieve that, COMPER first stores the transitions together with estimated Q-values into a structure named Transition Memory ($\mathcal{{TM}}$), which is similar to a traditional replay buffer, except for the presence of the Q-value and the identification and indexing of Similar Transitions Sets ($\mathcal{{ST}}$). After that, the similarities between the transitions stored in $\mathcal{{TM}}$ can be explored to generate a more compact version of it – the $\mathcal{{RTM}}$. Then, the transitions $(s_t, a_t, r_t, s_{t+1}) \sim U(\mathcal{{RTM}})$ are drawn uniformly from $\mathcal{{RTM}}$ and used to minimize the following loss function,

$$\begin{aligned} \small \hspace{-5pt}\mathcal{{L}}_{COMPER}(\Theta _i)= {\mathbb {E}}_{\tau _t=(s_t, a_t, r_t, s_{t+1}) \sim U(\mathcal{{RTM}})} \left[ \left( r_t + \gamma \, QT(\tau _t, \Omega ) - Q(s_t, a_t, \Theta _i) \right) ^2 \right] , \end{aligned}$$

(36)

in which $Q(s_t, a_t, \Theta _i)$ is a Q-function approximated by a CNN parameterized by $\Theta _i$ at i-th iteration. $QT(\tau _t, \Omega )$ is a Q-target function approximated by an LSMT and parameterized by $\Omega$. This function provides the target value and is updated in a supervised way from the $\mathcal{{ST}}$s stored in $\mathcal{T}\mathcal{M}$. Thus, this LSTM is also used to build a model to generate the compact structure of $\mathcal{{RTM}}$ from $\mathcal{T}\mathcal{M}$ while seeking to learn the dynamics of $\mathcal{{ST}}$s to provide better target values at the next agent update step.

At each training time-step t, the authors define a transition by a tuple $\tau _t = (s_t, a_t, r_t, s_{t+1}).$ Two transitions $\tau _{t_1} = (s_{t_1}, a_{t_1}, r_{t_1}, s_{{t_1}+1})$ and $\tau _{t_2} = (s_{t_2}, a_{t_2}, r_{t_2}, s_{{t_2}+1}), t_1 \ne t_2$, are similar ($\tau _{t_1} \approx \tau _{t_2}$) when the distance (e.g., Euclidean distance) between $\tau _{t_1}$ and $\tau _{t_2}$ is less than a threshold, that is, $\mathcal{{D}}(\tau _{t_1}, \tau _{t_2}) \le \delta$, in which $\delta$ is a threshold value of distance (or similarity). The N transitions that occurred up to a time instant are stored in $\mathcal{{TM}}$ and can be identified as subsets of similar transitions $\mathcal{{ST}}$ when the similarity condition is satisfied. In addition, they are stored throughout subsequent agent training episodes and are identified by a unique index. Therefore, the authors define $\mathcal{{TM}}=\left\{ [T^i, \mathcal{{ST}}_i]\,|\,i=1,2,3,\ldots , N_{ST} \right\}$, in which $N_{ST}$ is the total number of distinct subsets of similar transitions, $T^i$ is a unique numbered index and $\mathcal{{ST}}_i$ represents a set of similar transitions and their Q-values. Thus,

$$\begin{aligned} \mathcal{{ST}}_i = \left\{ \left[ \tau _{i(1)}, Q_{i(k)}\right] \;|\; 1 \le k \le N^i_{ST} \right\} \end{aligned}$$

(37)

in which $N^i_{ST}$ represents the total number of similar transitions in the set $\mathcal{{ST}}_i$. Thus, $\tau _{i(1)}$ corresponds to some transition $\tau _{t_j}, j \in \{1, \ldots \ N^i_{ST}\}$ and is the representing transition of similar transitions set $\mathcal{{ST}}_i$ (e.g., the first one), and $Q_{i(k)}$ is the Q-value corresponding to some transition $\tau _{t_j}, j \in \{1, \ldots \ N^i_{ST}\}$ such that $\tau _{i(1)} \in \mathcal{{ST}}_i$ and $\tau _{i(1)} \approx \tau _{i(k)}, 1 \le k \le N^i_{ST}$. Therefore, $\mathcal{{TM}}$ can seem as a set of $\mathcal{{ST}}$s. A single representative transition for each $\mathcal{{ST}}$ can be generated together with the prediction of their next Q-value from an explicit model of $\mathcal{{ST}}$ using the LSTM. This way, from $\mathcal{T}\mathcal{M}$, one can produce a $\mathcal{{RTM}}$ in which $\tau '_i$ is the transition that represents all the similar transitions so far identified in $\mathcal{{ST}}_i$, so that $\mathcal{{RTM}} = \left\{ [\tau '_i]\,|\,i=1,2,3,\ldots , N_{ST} \right\}$. Unlike $\mathcal{{TM}}$, $\mathcal{{RTM}}$ does not take care of sets of similar transitions since each $\tau '_i$ is unique and represents all the transitions in a given $\mathcal{{ST}}_i$. According to the authors, it gives the transitions stored in $\mathcal{{RTM}}$ the chance of having their Q-values re-estimated. Besides, sampling from $\mathcal{{RTM}}$ increases the chances of selecting rare and very informative transitions more frequently, at the same time, that helps increase diversity (because of variability in each sample).

One could observe in Algorithm 7 that COMPER slightly modifies the $\epsilon$-Greedy algorithm to return the estimation of the Q-value together with the action that maximizes it.

One of the main contributions of Experience Replay (ER) is to reduce nonstationarity and decorrelate the agent’s updates, contributing to the stabilization when using deep neural networks to approximate the value functions. However, how it stores and samples the agent’s experiences using the experience replay memory limits its use to off-policy reinforcement learning algorithms. In place of ER, Mnih et al. (2016) proposed using asynchronous gradient descent to optimize deep neural networks and train several agents in parallel in multiple instances of the environment. According to the authors, this parallelism also decorrelates the agents’ data because, at each time step, the parallel agents will probably be experiencing various and different states and can explicitly use different exploration policies to maximize their diversity. Moreover, by running different exploration policies in multiple threads, the overall updating changes by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single online agent, fulfilling the role of stabilizing undertaken by ER. The authors demonstrated that their approach could be used in off-policy and on-policy algorithms by presenting multithreaded asynchronous variants of Q-learning, Sarsa, and Advantage Actor-Critic methods. Their best-evaluated algorithm called the Asynchronous Advantage Actor-Critic (A3C) surpassed the state-of-the-art (at its publication time) on the Atari 2600 domain in ALE and reduced the training time because that is roughly linear in the number of parallel actor-learners. The authors also evaluated A3C on the MuJoCo physics simulator domain (Todorov et al. 2012).

4.4 Combining benefits in ensemble methods

Many relevant improvements in DQN-based methods approach different aspects. Although DDQN addresses the overestimation bias of Q-learning and (as a consequence) of DQN, PER improves data efficiency in experience replaying. The Dueling Network improved the generalization across actions by representing state values and action advantages separately. A3C shifts the bias-variance trade-off by learning from multistep bootstrap targets and helps to propagate newly observed rewards faster to earlier visited states. Distributional Q-learning learns a categorical distribution of discounted returns instead of estimating the mean. Noisy DQN uses stochastic network layers for exploration. Given this, Hessel et al. (2018) investigated how to combine these different but complementary ideas, using ER, into an ensemble approach called Rainbow, which achieved state-of-the-art on 57 Atari 2600 games in ALE (Bellemare et al. 2013) concerning data efficiency and final performance.

The authors adapted the PER strategies to use the KL loss of the Distributional Q-Learning, replaced the one-step distributional loss with a multistep variant, and defined the target distribution by contracting the value distribution in $S_{t+n}$ and shifting it by the truncated n-step discounted return. They combined multistep distributional loss with Double Q-Learning and used the greedy strategy to select and evaluate the action in $S_{t+n}$ using the target and online networks. They also adapted the dueling network architecture for use with return distributions so that the output of a shared state representation layer is fed into a value stream and an advanced stream designed to output distributional values that are combined as in Dueling Networks and then passed through a softmax layer to obtain the normalized parametric distributions used to estimate the returns’ distributions. Finally, they replaced all linear layers with equivalent noisy layers and used factorized Gaussian noise (Fortunato et al. 2018) to reduce the number of independent noise variables. An open-source variation of Rainbow is available in the framework for RL agents development called Dopamine (Castro et al. 2018), which differs from the original Rainbow (Hessel et al. 2018) by not including DDQN, dueling heads or noisy networks. It uses the n-step returns, which is identified by Fedus et al. (2020) as a critical element to improve the agent performance when using a larger replay buffer (i.e., 10 million experiences instead of the classical 1 million limited size). The n-step returns updates the Q-value function from an n-step target value rather than one-step so that the target side of Q-learning would be changed from Eqs. 38, 39. The authors interpret it as an interpolation between estimating targets in Monte Carlo (MC) methods as $\sum _{k=0}^{T} \gamma ^k r_{t+k}$ (a discussion can be found in Sutton and Barto (2018)) and single-step TD-learning, balancing the low bias but high variance of the MC targets and the low variance but high bias of TD(0) (see Sect. 2).

$$\begin{aligned} & r +\gamma \max _{a}Q(s_{t+1},a)- Q(s_t,a_t) \end{aligned}$$

(38)

$$\begin{aligned} & \quad \sum _{k=0}^{n-1} \gamma ^k r_{t+k}+\gamma ^n \max _a Q\left( s_{t+n}, a\right) \end{aligned}$$

(39)

Kaiser et al. (2019) presented a model-based deep reinforcement learning algorithm with a video prediction model named SimPLe, which performed well after just 102,400 interactions (that correspond to 409,600 frames on ALE and about 800,000 samples from the video prediction model) and compared their results with the ones obtained by Rainbow (Hessel et al. 2018). They aimed to show that planning with a parametric model allows for data-efficient learning on several Atari video games. In that sense, van Hasselt et al. (2019) proposed a broad discussion about model-based algorithms and experience replay, pointing out its commonalities and differences, when to expect benefits from either approach, and how to interpret prior works in this context. They set up experiments in a way comparable to Kaiser et al. (2019). They demonstrated that in a like-for-like comparison, Rainbow outperformed the scores of the model-based agent with less experience and computation. Rainbow used a total number of 3.2 million replayed samples, and SimPLe used 15.2 million. Łukasz Kaiser et al. (2020) presented their final published paper comparing SimPLe and Rainbow on the number of iterations needed to achieve the best results. SimPLe achieved the best game scores on half of the game set. However, the authors state that one of the SimPLe limitations is that its final scores are, on the whole, lower than the best state-of-the-art model-free methods.

5 Challenges and trends in experience replay

The challenges we found in the literature, from the early propositions until the most recent research, allowed us to identify a set we could consider essential problems, such as the ones Experience Replay proposed to solve. However, there are different classes of relatively recent issues arising from previously proposed approaches to open problems, such as those that PER proposed to address at the expense of possible bias or the many methods that suffer from catastrophic forgetting by benefiting from the ER, for example. This section identifies some relevant general problems in the Experience Replay domain (despite the many benefits of each approach in the literature), as presented in Table 1, and selects some to bring exciting and in-depth discussions of the literature.

Table 1 Main challenges in reinforcement learning with experience replay

Advances and challenges in learning from experience replay

Abstract

Similar content being viewed by others

Hindsight-Combined and Hindsight-Prioritized Experience Replay

Practical Recommendations for Replay-Based Continual Learning Methods

Extending the Capabilities of Reinforcement Learning Through Curriculum: A Review of Methods and Applications

Explore related subjects

1 Introduction

2 Background

3 Experience replay

4 Experience replay in deep reinforcement learning

4.1 Variations on convolutional neural networks in Q-learning and double Q-learning-based approaches

4.2 Dealing with continuous-valued action spaces in off-policy RL

4.3 Improving sample data efficiency in experience replay

4.4 Combining benefits in ensemble methods

5 Challenges and trends in experience replay

5.1 Replay buffer size

5.2 Exploration efficiency

5.3 Sampling efficiency

5.4 Data efficiency

5.5 Catastrophic forgetting

5.6 Sparse rewards

6 Research and applications on ER and some directions for future works

7 Structured summary of literature

8 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix A: indexing database query results

Appendix A: indexing database query results

1.1 A1 - ACM-DL

1.2 A2 - CAPES

1.3 A3 - IEEE xplore

1.4 A4 - Science direct

1.5 A5 - Scopus

Rights and permissions

About this article

Cite this article

Share this article

Keywords