Skip to main content
Log in

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Offline Reinforcement Learning (Offline RL) is able to learn from pre-collected offline data without real-time interaction with the environment by policy regularization via distributional constraints or support set constraints. However, since the policy learned from offline data under the constrains of support set is usually similar to the behavioral policy due to the overly conservative constraints, offline RL confronts challenges in active behavioral exploration. Moreover, without online interaction, policy evaluation becomes prone to inaccuracy, and the learned policy may lack robustness in the presence of sub-optimal state-action pairs or noise in a dataset. In this paper, we propose an Offline-to-Online Reinforcement Learning Approach based on Multi-action Evaluation with Policy Extension(MAERL) for improving the ability of the policy exploration and the effective value evaluation of state-action in offline RL. In MAERL, we develop four modules: (1) in the policy extension module, we design a policy extension method, which uses the online policy to extend the offline policy; (2) in the multi-action evaluation module, we present an adaptive manner to merge the offline and online policies to generate an action of the agent; (3) in the action-oriented module, we learn the action trajectories of the agent from the dataset, mitigating the issue of actions deviating excessively during environmental exploration; (4) to maintain the consistency in the agent’s actions, we propose an action temporally-aligned representation learning method to maintain the trend of actions of agents. This approach ensures that the agent’s actions align with the learned trajectories, preventing significant deviations during exploration. Extensive experiments are conducted on 15 scenarios of the D4RL/mujoco environment. Results demonstrate that our proposed methods achieve the best performance in 12 scenarios and the second-best performance in 3 scenarios compared to state-of-the-art methods. The project’s code can be found at https://github.com/FrankGod111/Policy-Expansion.git

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The datasets generated or analyzed during this study are available in the D4RL repository, https://github.com/rail-berkeley/d4rl.git . The code can be accessed in URL: https://github.com/Farama-Foundation/D4RL. The article includes a URL address for accessing the code, which is also available for real-time access.

References

  1. Fu J, Kumar A, Nachum O, Tucker G, Levine S (2020) D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219

  2. McDonald MJ, Hadfield-Menell D (2022) Guided imitation of task and motion planning. Proceedings of the 5th conference on robot learning. 164:630–640

  3. Chen X, Yao L, McAuley J, Zhou G, Wang X (2023) Deep reinforcement learning in recommender systems: A survey and new perspectives. Knowl-Based Syst 264:110335

    Article  Google Scholar 

  4. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshops

  5. Gupta A, Kumar V, Lynch C, Levine S, Hausman K (2020) Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Proceedings of the Conference on Robot Learning. 100:1025–1037

    Google Scholar 

  6. Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. Proceedings of the 36th international conference on machine learning. 97:2052–2062

  7. Schwarzer M, Rajkumar N, Noukhovitch M, Anand A, Charlin L, Hjelm RD, Bachman P, Courville AC (2021) Pretraining representations for data-efficient reinforcement learning. Adv Neural Inf Process Syst 34:12686–12699

    Google Scholar 

  8. Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of North American chapter of the association for computational linguistics, pp 4171–4186

  9. Campos V, Sprechmann P, Hansen S, Barreto A, Kapturowski S, Vitvitskyi A, Badia AP, Blundell C (2021) Beyond fine-tuning: Transferring behavior in reinforcement learning. Proceedings of the international conference on machine learning 2021 workshop on unsupervised reinforcement learning

  10. Nair A, Gupta A, Dalal M, Levine S (2020) Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359

  11. Nakamoto M, Zhai S, Singh A, Sobol Mark M, Ma Y, Finn C, Kumar A, Levine S (2023) Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Adv Neural Inf Process Syst 36:62244–62269

    Google Scholar 

  12. Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inform Process Syst 32

  13. Zhang H, Xu W, Yu H (2023) Policy expansion for bridging offline-to-online reinforcement learning. Proceedings of the eleventh international conference on learning representations

  14. Seo Y, Lee K, James SL, Abbeel P (2022) Reinforcement learning with action-free pre-training from videos. Proceedings of the 39th international conference on machine learning, 162:19561–19579

  15. Son S, Zheng L, Sullivan R, Qiao Y-L, Lin M (2023) Gradient informed proximal policy optimization. Adv Neural Inf Process Syst 36:8788–8814

    Google Scholar 

  16. Fujimoto S, Chang W-D, Smith E, Gu SS, Precup D, Meger D (2023) For sale: State-action representation learning for deep reinforcement learning. Adv Neural Inf Process Syst 36:61573–61624

    Google Scholar 

  17. Bhatt A, Palenicek D, Belousov B, Argus M, Amiranashvili A, Brox T, Peters J (2024) Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. Proceedings of the international conference on learning representations (ICLR)

  18. Li R, Shang Z, Zheng C, Li H, Liang Q, Cui Y (2023) Efficient distributional reinforcement learning with kullback-leibler divergence regularization. Appl Intell 53(21):24847–24863

    Article  Google Scholar 

  19. Shang Z, Li R, Zheng C, Li H, Cui Y (2023) Relative entropy regularized sample-efficient reinforcement learning with continuous actions. IEEE Trans Neural Netw Learn Syst pp 1–11

  20. Hiraoka T, Imagawa T, Hashimoto T, Onishi T, Tsuruoka Y (2022) Dropout q-functions for doubly efficient reinforcement learning. Proceedings of the tenth international conference on learning representations, ICLR 2022, Virtual Event, April 25–29 2022

  21. Zhao X, Ding S, An Y, Jia W (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591

    Article  Google Scholar 

  22. Ding S, Zhao X, Xu X, Sun T, Jia W (2019) An effective asynchronous framework for small scale reinforcement learning problems. Appl Intell 49(12):4303–4318

    Article  Google Scholar 

  23. Du X, Chen H, Wang C, Xing Y, Yang J, Yu PS, Chang Y, He L (2024) Robust multi-agent reinforcement learning via bayesian distributional value estimation. Pattern Recogn 145:109917

    Article  Google Scholar 

  24. Ciosek K, Vuong Q, Loftin R, Hofmann K (2019) Better exploration with optimistic actor critic. Adv Neural Inf Process Syst 32:103368

  25. Wu J, Wu H, Qiu Z, Wang J, Long M (2022) Supported policy optimization for offline reinforcement learning. Adv Neural Inf Process Syst 35:31278–31291

    Google Scholar 

  26. Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34:20132–20145

    Google Scholar 

  27. Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191

    Google Scholar 

  28. Kostrikov I, Fergus R, Tompson J, Nachum O (2021) Offline reinforcement learning with fisher divergence critic regularization. Proceedings of the 38th international conference on machine learning, 139:5774–5783

  29. Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: model-based offline reinforcement learning. Proceedings of the 34th international conference on neural information processing systems

  30. Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142

    Google Scholar 

  31. Liu H, Abbeel P (2021) Behavior from the void: Unsupervised active pre-training. Adv Neural Inf Process Syst 34:18459–18473

    Google Scholar 

  32. Wu J, Wu H, Qiu Z, Wang J, Long M (2022) Supported policy optimization for offline reinforcement learning. Adv Neural Inf Process Syst 35:31278–31291

    Google Scholar 

  33. Lee S, Seo Y, Lee K, Abbeel P, Shin J (2022) Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. Proceedings of the 5th conference on robot learning, 164:1702–1712

  34. Yang M, Nachum O (2021) Representation matters: Offline pretraining for sequential decision making. Proceedings of the 38th international conference on machine learning. 139:11784–11794

  35. Uchendu I, Xiao T, Lu Y, Zhu B, Yan M, Simon J, Bennice M, Fu C, Ma C, Jiao J, Levine S, Hausman K (2023) Jump-start reinforcement learning. Proceedings of the 40th international conference on machine learning, 202:34556–34583

  36. Kostrikov I, Nair A, Levine S (2021) Offline reinforcement learning with implicit q-learning. Advances in deep reinforcement learning workshop conference on neural information processing systems

  37. Sundhar Ramesh S, Giuseppe Sessa P, Hu Y, Krause A, Bogunovic I (2024) Distributionally robust model-based reinforcement learning with large state spaces. Proceedings of the 27th international conference on artificial intelligence and statistics, 238:100–108

  38. Guo S, Zou L, Chen H, Qu B, Chi H, Yu PS, Chang Y (2024) Sample efficient offline-to-online reinforcement learning. IEEE Trans Knowl Data Eng 36(3):1299–1310

    Article  Google Scholar 

  39. Li P, Tang H, Yang T, Hao X, Sang T, Zheng Y, Hao J, Taylor ME, Tao W, Wang Z (2022) PMIC: Improving multi-agent reinforcement learning with progressive mutual information collaboration. Proceedings of the 39th international conference on machine learning, 162:12979–12997

  40. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th international conference on machine learning, 80:1861–1870

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62462031, in part by the Natural Science Foundation of Jiangxi Province under Grant 20232BAB202018.

Author information

Authors and Affiliations

Authors

Contributions

Xuebo Cheng and Xiaohui Huang were involved in conceptualization, data curation, methodology, software, validation, writing original draft. Zhichao Huang helped in conceptualization, writing, supervision and review. Nan Jiang helped in data curation, supervision.

Corresponding author

Correspondence to Xiaohui Huang.

Ethics declarations

Ethical and informed consent for data used

Relevant ethical guidelines and regulations were followed in conducting this study.

Competing Interests

The authors declare that there are no potential competing interests or conflicting relationships in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Algorithm

Algorithm 1
figure e

MAERL: Multi-Action Evaluation with Policy Extension.

In this section, we provide pseudo-code: Algorithm 1 for MAERL. The framework is essentially divided into two phases: offline training and online training. During the offline training phase, the model randomly selects data from the dataset (\(D_{offline}\)) for training, optimizing both Q-values and policies. Simultaneously, in the action-oriented module, random samples are utilized to learn the agent’s implicit policy. In the online training phase, the policy learned in the offline phase is retained while an additional online policy is trained. The information from both policies is fused using mutual information to obtain the final \(a_{target}\) for interacting with the environment. The model’s learning performance is enhanced through policy extension and multi-action evaluation. The action-oriented process introduces implicit action guidance, contributing to improved exploration capabilities. The algorithm’s central concept involves training a conservative policy during offline learning and subsequently exploring during online learning. Policy extension and multi-action evaluation are employed to organically integrate knowledge.

When IQL is used for offline training followed by SAC for online training, there arises an important question: Can the Q-function trained offline be directly used in online SAC training? The answer is yes: according to many offline-to-online articles such as [10, 11, 13, 31] and the experiments provided in this study, it is actually beneficial to continue using the pretrained Q-function for online training (compared to starting from scratch). Although there may be conflicts due to different update targets, as mentioned in Cal-QL [11], where the excessively conservative Q-function can lead to sub-optimal actions during online training, exacerbating the performance drop. This paper further discovers that the policy recovery process is the process of the Q-function returning to normal, which is a recovery process from performance drop. Mitigating this phenomenon will bring greater significance to the use of Q-function parameters.

Appendix B: Baseline implementations and experimental details

1.1 B.1 Parameters setting and hyper-parametric studies

The parameter settings for the comparison algorithms (including baselines) mentioned in the experimental part of this paper are shown in the Tables 5, 6, 7, 3, 9, and 10 below. Specifically, for the mujoco task in SPOT’s experiments, we follow previous work and omit the evaluation of the medium expert and expert datasets, where the offline pre-trained agents can already achieve expert-level performance without further fine-tuning. The experiments are repeated with 5 different random seeds, and the authors have provided checkpoints for the pre-trained agents in the source code, so it is sufficient for us to directly employ these checkpoints for online tuning. After the experiments, we normalized the states in mujoco’s dataset but not in the antmaze dataset.

Table 5 Hyper-parameters of SAC
Table 6 Hyper-parameters of CQL
Table 7 Hyper-parameters of Cal-QL

It is important to mention that unlike SPOT’s similar approach of SUNG, where SPOT utilizes VAE to estimate the density of behavioral policies, SUNG utilizes VAE to estimate the density of state-actions. Therefore, we set the latent dimension as 2 \(\times \) (state dim + action dim) instead of 2 \(\times \) action dim. For online tuning, we follow the same experimental environment and parameter setting requirements for all baselines. For baseline replication, we strictly follow their official implementation and hyper-parameters reported in the original paper to fine-tune the processing of the TD3+BC, CQL, and SPOT pre-trained mujoco and antmaze datasets. To reflect better fairness, the reintegration methods were removed from the subsequent BT (behavior transfer) and BC (behavior cloning) methods.

Table 8 Hyper-parameters of IQL

1.2 B.2 Combining behavior cloning and online reinforcement learning

Behavior Cloning (BC) combines the advantages of supervised learning and reinforcement learning by first using supervised learning techniques to learn a policy from an expert example, and then further improving the policy to adapt to the environment through online fine-tuning. The BC-online part of the sequence of actions taken to perform a task in a given environment learns the action to take in a given state by minimizing the difference between the policy and that of the expert example. In the target environment, the model interacts with the real environment and makes policy improvements based on actual reward signals. The goal of the decision-maker is to find a steady-state policy \(\pi : S\longrightarrow \bigtriangleup (A)\) to maximize the cumulative reward:

$$\begin{aligned} V(\pi )\!=\!\mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}R(s_{t},a_{t})\mid s_{0}\sim \rho ,a_{t}\sim \pi (\cdot \mid s_{t}),\forall t\ge 0]. \end{aligned}$$
(B1)
Table 9 Hyper-parameters of AWAC
Table 10 Hyper-parameters of TD3+BC

And, suppose that there is a dataset \(D=\left\{ (s_{i},a_{i})\right\} _{i=1}^{m}\) collected by an expert policy \(\pi _{E}\) where each state-action pair is generated by the interaction of \(\pi _{E}\) and the environment. Since we tend to assume that the expert policy is of high quality, our goal becomes to find a policy \(\pi \) to minimize the difference in the value function with the expert policy:

$$\begin{aligned} \min _{\pi } [V(\pi _{E})-V(\pi )]. \end{aligned}$$
(B2)

As:

$$\begin{aligned} & \min _{\pi } \mathbb {E} _{s\sim d_{\pi _{E}}}[D_{KL}(\pi _{E}(\cdot \mid s),\pi (\cdot \mid s))]: =\mathbb {E}_{(s,a)\sim \rho _{\pi _{E}}}\nonumber \\ & \quad \times [\log {\frac{\pi _{E}(a\mid s)}{\pi (a \mid s)} } ] . \end{aligned}$$
(B3)

Here \(d_{\pi _{E}}\) and \(\rho _{\pi _{E}}\) have the (discounted) state distribution and state-action distribution generated by the policy \(\pi _{E}\) , respectively, as defined in the general case:

$$\begin{aligned} d_{\pi }(s)= & \mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}\textrm{Pr}(s_{t}\!=\!s)\mid s_{0}\sim \rho ,a_{t}\sim \pi (\cdot \mid s_{t}) ,\forall t\!\ge \! 0 ], \end{aligned}$$
(B4)
$$\begin{aligned} \rho _{\pi }(s,a)= & \mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}\textrm{Pr}(s_{t}\!=\!s,a_{t}\!=\!a)\!\mid s_{0}\sim \rho ,a_{t}\sim \pi \nonumber \\ & \times (\cdot \mid s_{t}) ,\forall t\ge 0 ]. \end{aligned}$$
(B5)

1.3 B.3 Combining Behavior Transfer and Online Reinforcement Learning

Behavior Transfer (BT) is an approach initially aimed at machine learning methods that use unsupervised or self-supervised learning and can be pre-trained in conjunction with reinforcement learning. The idea is to transfer the learned behaviors to later learning stages by applying the policy learned in the source environment to the target environment, although fine-tuning the policy is often required due to environmental differences. The training process is illustrated in Fig. 14. A policy alignment or value function transfer approach can be used to find the most suitable method for the target environment. The composition of BT data is depicted in Fig. 13, where N denotes the number of trajectories, \(N_{S}\) denotes the number of sub-optimal policy aspects, \(N_{E}\) denotes the number of expert policy aspects, and \(\eta \) is the balancing factor:

$$\begin{aligned} N_{S}=(1-\eta )N,N_{E}=\eta N. \end{aligned}$$
(B6)

This BT approach was originally used in discrete action scenarios as well. The value function can be expressed as:

$$\begin{aligned} V(\pi )=\mathbb {E} \left[ \sum _{h=1}^{H} r(s_{h},a_{h})\mid P,\pi \right] , \end{aligned}$$
(B7)

where \(\mathcal {H} \) is maximum length of a trajectory.

Here, we apply this concept to establish a baseline within our specific context, drawing inspiration from the approach outlined by Campos et al. (2021). Following the methodology presented in their work, we utilize a Zeta distribution characterized by parameter \(a = 2\) to determine the duration of persistent unrolling steps using the offline policy. The number of persistent unroll steps is subject to resampling when not actively engaged in a persistent unroll phase, and this resampling occurs when a randomly sampled number from the interval [0, 1] is less than a predefined threshold \(\epsilon \). For our experiments, we set \(\epsilon \) to 0.1.

This adaptation enables us to establish a robust baseline in our experimental framework, closely aligning with the principles introduced by Campos et al. in 2021. It provides a systematic approach to determining the duration of persistent unrolling steps and introduces a level of randomness to ensure adaptability during the learning process. A comparison of BC, BT, and fine-tuning experiments based on these approaches is depicted in Fig. 3. The mean normalized scores after learning the policy on two tasks with complementary datasets under different conditions are shown. A higher score indicates better performance. The experimental results indicate that (1) in the noisy expert condition, the performance of MAERL is roughly equivalent to the results of BC+Online but slightly better than the BT+Online approach, and (2) overall, the MAERL algorithm outperforms both BC+Online and BT+Online approaches in tasks under both types of conditions.

1.4 B.4 Parameter description for offline-to-online reinforcement learning

The method aims to improve offline-to-online RL performance and consists of two components:

  • training an ensemble of critic and policy networks in the offline training phase, whereas the traditional offline approach learns a policy network and a critic network;

  • balanced playback: balancing the offline-online playback scheme to make an appropriate trade-off between the use of offline and online samples.

We used the code published by the authors. The specific details are as follows:

  • Training details for offline RL. For the network architecture, we use a 2-layer multi-layer perceptual machine model (MLP) for the value and policy networks (but in the halfcheetah-medium and halfcheetah-medium-replay scenarios we find that a 3-layer MLP network is more effective for training agents). For a given experimental scenario, we satisfy the setup of Kumar et al. while ensuring fairness.

  • Training details for online RL. For this method, the Adam optimizer is used and the policy learning rate is chosen from 3e-4,3e-5,3e-6, value learning rate of 3e-4.

  • Training details for balanced replay. For the training density ratio estimation network, in this paper we fix it as a 2-layer MLP network considering the actual experimental situation. we used batch size 256 (i.e., 256 offline samples and 256 online samples), and learning rate 3e-4 for all locomotion experiments.

1.5 B.5 Distribution shift of offline samples and online samples

We present a comparison between the log-likelihood estimation of the distribution of offline samples during model training and the log-likelihood estimation of online samples collected by the offline RL agent. The divergence or convergence of log-likelihood estimates between offline and online samples becomes crucial. From Fig. 12, we observe that the model captures differences in the distribution of the two datasets, revealing its effectiveness in handling the transition from offline to online learning. By examining the log-likelihood estimation of data distribution in offline and online environments and its consistency with the target distribution, we gain insights into the model’s ability to ensure that the learned policy performs well not only on offline data but also generalizes effectively to dynamic and potentially different distributions encountered during online interactions.

Fig. 12
figure 12

In training the model on for the offline dataset, the log-likelihood estimation of the distribution of the offline samples and the online samples collected by the agent of the offline RL

Fig. 13
figure 13

Conceptual representation of the composition diagram of Behavior Transfer (BT) data

Fig. 14
figure 14

Conceptual illustration of the training process for the Behavior Transfer (BT) method

1.6 B.6 Comparative analysis with CQL-fine tuning

The CQL algorithm aims to enhance sample efficiency and stability by introducing an additional constraint to limit the upper bound of Q-value estimates, thereby improving Q-learning. In other words, it corrects the estimation of Q-values in a conservative manner to enhance the robustness of training. However, when fine-tuning algorithms utilize CQL to update the Q-function with online data, erroneous peaks may occur on suboptimal actions (x-axis). A specific diagram is shown in Fig. 15. Such a process can lead to the deviation of the policy from high-reward actions covered by the dataset, favoring incorrect new actions and resulting in the degradation of the pretrained policy.

In contrast to the approach proposed in this paper, which is Multi-Agent Reinforcement Learning (MAERL), we incorporate a multi-action evaluation method when estimating the Q-function. This approach ensures both a conservative estimation of Q-values and coverage of all actions by the policy. During fine-tuning, it avoids missing the optimal values and, to some extent, accelerates exploration of new states.

Fig. 15
figure 15

Comparison between the learned Q-value and the true value after a given state, where figures a and b represent the fine-tuning of CQL and the fine-tuning process of MAERL, respectively. The figure visualizes the slicing operation of learned Q-function versus ground truth values for a given state

Fig. 16
figure 16

The comparison of the changes in average Q-values and score evaluations during offline pre-training and online fine-tuning processes

1.7 B.7 Average Q-value and score evaluation

In this section, we visualize the changes in average Q-values and score evaluation during the model’s offline pre-training and online fine-tuning processes, as shown in Fig. 16. The fine-tuning starts at 10K steps, with the red portion indicating the performance recovery period, coinciding with the Q-value adjustment phase.

During the offline learning phase, the Q-values using CQL or IQL are lower than the true values. This is attributed to factors such as conservative estimation or importance weighting, designed to provide a conservative estimate on the offline dataset. As the conservative Q-function starts interacting with online data, it may exhibit better performance on extrapolation errors compared to the highly conservative estimated offline Q-function. Extrapolation error refers to the error in environments outside the known or offline dataset. This implies that the conservative Q-function may perform better in certain scenarios, even though it underestimates Q-values on the offline data. However, this can lead to a misleading emphasis on optimizing the policy for higher returns in regions with larger extrapolation errors. Such a situation may result in the algorithm neglecting the initial policy because it prioritizes policies performing well in specific environments.

1.8 B.8 Analysis of the reasons why overvalued explorations fail when fine-tuning CQL

In an experiment on CQL fine-tuning, we found that agent selects actions with higher value (Q-value) during exploration, and there is a deterioration of performance when MAERL is combined with CQL. The results of this experiment are now analyzed. First, the actual implementation of CQL for policy evaluation is reviewed:

$$\begin{aligned} & arg\min _{Q} \mathbb {E} _{s\in \mathcal {D} }\left[ log\sum _{a} exp(Q(s,a))-\mathbb {E}_{a\sim \pi _{\beta }(a\mid s) }\left[ Q(s,a) \right] \right] \nonumber \\ & \quad +\mathbb {E}_{(s,a,r,s^{'})\sim \mathcal {D} } \left[ (Q(s,a)-y)^{2} \right] , \end{aligned}$$
(B8)

where \(\pi _{\beta }\) is the behavioral policy for the offline dataset, and y is the standard TD goal for policy evaluation. The formula can be seen that there are three terms in the above equation: given a state, the first minimizes the Q-value of all actions; the second maximizes the Q-value of the actions collected in the dataset; and the third is the standard TD-learning. therefore, given state s, when we select the higher-valued action \(a_{h}\) to explore, transition(\(s,a_{h},s^{'},r\)) will be collected into the replay buffer. Then, during the search process, combining the first and second terms in Eq.B8 leads to a situation where most of the unseen actions have a low value and the first term makes a conservative estimate of all Q-values and very few seen actions have a high value, while the second term in turn performs a higher value assessment problem for the action. In the third term, which involves bootstrapping, high value actions are updated to have lower values in such a way that the values of all actions are consistently reduced in each step of the gradient. This will lead to the breakdown of the Q-value function, which in turn will lead to the creation of poor policies through propagation in policy improvement. In MAERL, this problem is solved by using a multi-action evaluation approach that takes into account both high-value and low-value actions, thus demonstrating the problem of instability for value estimation as well as preference for Q-values in the exploration process.

Appendix C: Convergence analysis

In this section, we give some proofs of the formulas. Note that it is sufficient to discuss only the general meanings of the characters that are not explained.

For the condition of offline and online interaction, the learning and training processes are based on Q-learning, so only the two policies of Q-function converging to the immovable point are needed to converge. the convergence of Q-function can be proved to study by whether the corresponding bellman optimal operator is \(\gamma -constraction\) or not.

For the Bellman optimality operator \(\mathcal {H}\), given the prior probability \(P_{z_{j}}\) that a specific task \(z_{j} \in Z\) performs action a in the state of \(Z_{j}\) task, with a reward of \(r(s,a|z_{j})\), and a transition probability of \(p(s' | s,a,z_{j})\), from state s to the next state \(s'\) has the following formulas hold:

$$\begin{aligned} \mathcal {H}Q(a\mid s,z_{j})=r(s,a\mid z_{j})+\gamma \mathbb {E}_{s^{'}}\max _{a^{'}} Q(a^{'}\mid s^{'},z_{j}), \end{aligned}$$
(C9)

where \(\mathbb {E}_{s^{'}}\) is short for \(\mathbb {E}_{s^{'}\sim p(s' | s,a,z_{j})}\).

For the \(z_{0}\), the Bellman optimality operator \(\tau \), giving a state-action pair:

$$\begin{aligned} \tau Q(a\mid s,z_{j})\!=\!\mathbb {E}_{z_{0}}r(s,a\mid z_{j}) +\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q(a^{'}\!\mid \! s^{'},z_{0}), \end{aligned}$$
(C10)

where \(\mathbb {E}_{z_{j}}\) is short for \(\mathbb {E}_{z_{j}\sim p(z_{j})}\).

1.1 C.1 Convergence analysis of q-functions computation for different policies

For any two functions \(Q_{1}(a\mid s,z_{0})\) and \(Q_{2}(a\mid s,z_{0})\), the Bellman optimality operator \(\tau \) is a \(\gamma -constraction\) as follows: let \(Q^{\beta }(a\mid s)\rightarrow \) offline Q-function, \(Q^{\theta }(a\mid s)\rightarrow \) online Q-function, the following equations holds:

$$\begin{aligned} \left\| \tau Q^{\beta }_{1}(a\mid s)-\tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\le \gamma \left\| Q^{\beta }_{1}(a\mid s)-Q^{\beta }_{2}(a\mid s) \right\| _{\infty } , \end{aligned}$$
(C11)

where \(\left\| \cdot \right\| _{\infty } \) is the sup-norm and \(\gamma \in (0,1)\). The same holds for \(Q^{\theta }(a\mid s)\).

$$\begin{aligned} & \left\| \tau Q^{\beta }_{1}(a\mid s) - \tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\nonumber \\= & \max _{s,a}\Bigg | \mathbb {E}_{z_{j}}r(s,a\mid z_{j}) +\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) -\mathbb {E}_{z_{j}}r(s,a\mid z_{j})\nonumber \\ & -\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\Bigg |\nonumber \\= & \gamma \max _{s,a}\left| \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| \nonumber \\= & \gamma \max _{s,a}\left| \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\left[ \max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right] \right| \nonumber \\\le & \gamma \max _{s,a} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \left| \max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| \nonumber \\\le & \gamma \max _{s,a} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \!\left| \max _{a\in A,s\in S} Q_{1}^{\beta }(a^{'}\!\mid s^{'},z_{0}) \!-\! \max _{a\in A,s\in S} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| . \end{aligned}$$
(C12)

Express the above equations in normalized form:

$$\begin{aligned} & \left\| \tau Q^{\beta }_{1}(a\mid s)-\tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\nonumber \\\le & \gamma \max _{a, s} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \left| \max Q_{1}^{\beta }(a\mid s) - \max Q_{2}^{\beta }(a\mid s)\right| \nonumber \\\le & \gamma \max _{a, s} \mathbb {E}_{z_{j}}\mathbb {E} \left\| Q_{1}^{\beta }(a\mid s) - Q_{2}^{\beta }(a\mid s) \right\| _{\infty }\nonumber \\= & \gamma \left\| Q_{1}^{\beta }(a\mid s) - Q_{2}^{\beta }(a\mid s) \right\| _{\infty }. \end{aligned}$$
(C13)

Suppose that \(Q^{*} \rightarrow Q^{*}(a \mid s)\) is the optimal goal Q-value for the policy, and \(Q^{*} = \tau Q^{*}\) holds. The following equation holds by iteration:

$$\begin{aligned} 0\le \left\| Q_{k+1}^{*} -Q_{k}^{*}\right\| _{\infty }\le \gamma \left\| Q_{k}^{*} \!-\!Q_{k-1}^{*}\right\| _{\infty }\le \cdots \le \gamma ^{k+1}\left\| Q_{1}^{*} -Q_{0}^{*}\right\| _{\infty } ; \end{aligned}$$
(C14)
$$\begin{aligned} \lim _{k \rightarrow \infty } \gamma ^{k+1}\left\| Q_{1}^{*} -Q_{0}^{*}\right\| =0, \gamma \in (0,1) . \end{aligned}$$
(C15)

Therefore, for any tasks, the state-action pair (sa) converges stably to the optimal immobilization point.

1.2 C.2 Latent policy optimization problem

The discussion in this section focuses on offline policies \(\pi _{\beta }\), online policies \(\pi _{\theta }\), and sample latent policy \(\pi _{latent}\) problems.

$$\begin{aligned} arg \max _{a} Q^{*}(a\mid s)=arg \max _{a}Q^{\pi }(a\mid s). \end{aligned}$$
(C16)

In the case where the above equation holds, there is the following equation:

$$\begin{aligned} arg \max _{a}Q^{\pi }(a\mid s)=arg \max _{a}Q^{\pi \rightarrow \left\langle {\pi ,Z_{o}}\right\rangle }(a\mid s). \end{aligned}$$
(C17)

The initialized latent policy may not be optimal. During the fine-tuning process, we have the opportunity to understand potential relationships within state-action pairs. Subsequently, a combination share module is employed to map potential actions to the actual action space, incorporating the learned mappings to exert a priori influence on subsequent actions. The potential action orientation, associated with each customized action, facilitates the selection of potential actions within the constrained action space, offering a certain degree of flexibility.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, X., Huang, X., Huang, Z. et al. An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension. Appl Intell 54, 12246–12271 (2024). https://doi.org/10.1007/s10489-024-05806-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05806-2

Keywords