Skip to main content

BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13151))

Abstract

Offline reinforcement learning (RL) aims to train an agent solely using a dataset of historical interactions with the environments without any further costly or dangerous active exploration. Model-based RL (MbRL) usually achieves promising performance in offline RL due to its high sample-efficiency and compact modeling of a dynamic environment. However, it may suffer from the bias and error accumulation of the model predictions. Existing methods address this problem by adding a penalty term to the model reward but require careful hand-tuning of the penalty and its weight. Instead in this paper, we formulate the model-based offline RL as a bi-objective optimization where the first objective aims to maximize the model return and the second objective is adaptive to the learning dynamics of the RL policy. Thereby, we do not need to tune the penalty and its weight but can achieve a more advantageous trade-off between the final model return and model’s uncertainty. We develop an efficient and adaptive policy optimization algorithm equipped with evolution strategy to solve the bi-objective optimization, named as BiES. The experimental results on a D4RL benchmark show that our approach sets the new state of the art and significantly outperforms existing offline RL methods on long-horizon tasks.

This work is partially supported by the Shenzhen Fundamental Research Program under the Grant No. JCYJ20200109141235597.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/aravindr93/mjrl/issues/35

References

  1. Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: NeurIPS, pp. 908–918 (2017)

    Google Scholar 

  2. Boney, R., Kannala, J., Ilin, A.: Regularizing model-based planning with energy-based models. In: CoRL (2019)

    Google Scholar 

  3. Cheng, R., He, C., Jin, Y., Yao, X.: Model-based evolutionary algorithms: a short survey. Complex Intell. Syst. 4(4), 283–292 (2018). https://doi.org/10.1007/s40747-018-0080-1

    Article  Google Scholar 

  4. Choromanski, K., et al.: Provably robust blackbox optimization for reinforcement learning. In: CoRL, pp. 683–696 (2020)

    Google Scholar 

  5. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS (2018)

    Google Scholar 

  6. Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., Abbeel, P.: Model-based reinforcement learning via meta-policy optimization. In: CoRL (2018)

    Google Scholar 

  7. Désidéri, J.A.: Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Math. 350(5), 313–318 (2012)

    Article  MathSciNet  Google Scholar 

  8. Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv:2004.07219 (2020)

  9. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: ICML (2019)

    Google Scholar 

  10. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870 (2018)

    Google Scholar 

  11. Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)

    Google Scholar 

  12. Janner, M., Mordatch, I., Levine, S.: \(\gamma \)-models: generative temporal difference learning for infinite-horizon prediction. arXiv:2010.14496 (2020)

  13. Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. arXiv:2005.05951 (2020)

  14. Kumar, A., Fu, J., Tucker, G., Levine, S.: Stabilizing off-policy q-learning via bootstrapping error reduction. In: NeurIPS (2019)

    Google Scholar 

  15. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv:2005.01643 (2020)

  16. Luo, J., Chen, L., Li, X., Zhang, Q.: Novel multitask conditional neural-network surrogate models for expensive optimization. IEEE Trans Cyber. 1–14 (2020)

    Google Scholar 

  17. Mania, H., Guy, A., Recht, B.: Simple random search of static linear policies is competitive for reinforcement learning. In: NeurIPS (2018)

    Google Scholar 

  18. Milojkovic, N., Antognini, D., Bergamin, G., Faltings, B., Musat, C.: Multi-gradient descent for multi-objective recommender systems. In: AAAI (2020)

    Google Scholar 

  19. Rajeswaran, A., Mordatch, I., Kumar, V.: A game theoretic framework for model based reinforcement learning. In: ICML, pp. 7953–7963 (2020)

    Google Scholar 

  20. Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017)

  21. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: ICLR (2016)

    Google Scholar 

  22. Shin, M., Kim, J.: Randomized adversarial imitation learning for autonomous driving. In: IJCAI, pp. 4590–4596 (2019)

    Google Scholar 

  23. Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 2(4), 160–163 (1991)

    Article  Google Scholar 

  24. Touré, C., Hansen, N., Auger, A., Brockhoff, D.: Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework. In: GECCO, pp. 638–646 (2019)

    Google Scholar 

  25. Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv:1911.11361 (2019)

  26. Xu, Y., Liu, M., Lin, Q., Yang, T.: ADMM without a fixed penalty parameter: faster convergence with new adaptive penalization. In: NeurIPS, pp. 1267–1277 (2017)

    Google Scholar 

  27. Yu, C., Ren, G., Liu, J.: Deep inverse reinforcement learning for sepsis treatment. In: ICHI, pp. 1–3 (2019). https://doi.org/10.1109/ICHI.2019.8904645

  28. Yu, T., et al.: MOPO: model-based offline policy optimization. arXiv:2005.13239 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhui Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Jiang, J., Wang, Z., Duan, Q., Shi, Y. (2022). BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning. In: Long, G., Yu, X., Wang, S. (eds) AI 2021: Advances in Artificial Intelligence. AI 2022. Lecture Notes in Computer Science(), vol 13151. Springer, Cham. https://doi.org/10.1007/978-3-030-97546-3_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-97546-3_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-97545-6

  • Online ISBN: 978-3-030-97546-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics