Skip to main content
Log in

LIDAR: learning from imperfect demonstrations with advantage rectification

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533

    Article  Google Scholar 

  2. Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489

    Article  Google Scholar 

  3. Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937

  4. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv:1707.06347

  5. Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1582–1591

  6. Lakshminarayanan A S, Ozair S, Bengio, Y. Reinforcement learning with few expert demonstrations. In: Proceedings of Neural Information Processing Systems Workshop on Deep Learning for Action and Interaction. 2016

  7. Rajeswaran A, Kumar V, Gupta A, Vezzani G, Schulman J, Todorov E, Levine S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. 2017, arXiv preprint arXiv:1709.10087

  8. Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P. Overcoming exploration in reinforcement learning with demonstrations. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 6292–6299

  9. Ebrahimi S, Rohrbach A, Darrell T. Gradient-free policy architecture search and adaptation. In: Proceedings of the Conference on Robot Learning. 2017, 505–514

  10. Reddy S, Dragan A D, Levine, S. SQIL: imitation learning via regularized behavioral cloning. 2019, arXiv preprint arXiv:1905.11108

  11. Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4565–4573

  12. Todorov E, Erez T, Tassa Y. Mujoco: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033

  13. Kang B Y, Jie Z Q, Feng J S. Policy optimization with demonstrations. In: Proceedings of International Conference on Machine Learning. 2018, 2469–2478

  14. Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016

  15. Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, Levine S. Soft actor-critic algorithms and applications. 2018, arXiv preprint arXiv:1812.05905

  16. Ng A Y, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 278–287

  17. Brys T, Harutyunyan A, Suay H B, Chernova S, Taylor M E, Nowé A. Reinforcement learning from demonstration through shaping. In: Proceedings of the 32nd International Joint Conferences on Artificial Intelligence. 2015, 3352–3358

  18. Jing M X, Ma X J, Huang W B, Sun F C, Yang C, Fang B, Liu H P. Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2020, 5109–5116

  19. Schulman J, Levine S, Abbeel P, Jordan M I, Moritz P. Trust region policy optimization. In: Proceedings of International Conference on Machine Learning. 2015, 1889–1897

  20. Abbeel P, Ng A Y. Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the 21st International Conference on Machine Learning. 2004

  21. Ng A Y, Russell S J. Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 663–670

  22. Li Y Z, Song J M, Ermon S. InfoGAIL: interpretable imitation learning from visual demonstrations. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3815–3825

  23. Shiarlis K, Messias J, Whiteson S. Inverse reinforcement learning from failure. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 1060–1068

  24. Brown D S, Goo W, Nagarajan P, Niekum S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: Proceedings of the International Conference on Machine Learning. 2019, 783–792

  25. Nagarajan P. Inverse reinforcement learning via ranked and failed demonstrations. 2016

  26. Oh J, Guo Y J, Singh S, Lee H. Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 3878–3887

  27. Wu Y H, Charoenphakdee N, Bao H, Tangkaratt V, Sugiyama M. Imitation learning from imperfect demonstration. In: Proceedings of International Conference on Machine Learning. 2019, 6818–6827

  28. Sun W, Bagnell J A, Boots B. Truncated horizon policy search: combining reinforcement learning & imitation learning. In: Proceedings of the 7th International Conference on Learning Representations. 2018

  29. Vecerík M, Hester T, Scholz J, Wang F M, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M A. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. 2017, arXiv preprint arXiv:1707.08817

  30. Gao Y, Xu H, Lin J, Yu F, Levine S, Darrell T. Reinforcement learning from imperfect demonstrations. 2018, arXiv preprint arXiv:1802.05313

  31. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 387–395

  32. Sutton R S, McAllester D A, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057–1063

  33. Munos R, Stepleton T, Harutyunyan A, Bellemare M. Safe and efficient off-policy reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1054–1062

  34. Kakade S, Langford J. Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. 2002, 267–274

  35. Hasselt H. Double Q-learning. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. 2010, 2613–2621

  36. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. Openai gym. 2016, arXiv preprint arXiv:1606.01540

Download references

Acknowledgements

This work was supported by the National Key R&D Plan (2016YFB0100901), the National Natural Science Foundation of China (Grant Nos. U20B2062 & 61673237) and the Beijing Municipal Science & Technology Project (Z191100007419001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huimin Ma.

Additional information

Xiaoqin Zhang is a PhD student in Department of Electronic Engineering, Tsinghua University, China where he received his BS degree in 2015. His current research interests are reinforcement learning and robotics, with particular interest in learning from demonstrations.

Huimin Ma is a professor in the School of computer and communication, University of Science and Technology, China. She is the dean of the Department of Internet of Things and Electronic Engineering and the vice president of Institute of artificial intelligence. She was the director of 3D Image Lab in the Department of Electronic Engineering of Tsinghua University, China. She is also the secretary-general of China Society of Image and Graphics. Her research interests is 3D image cognition and simulation. She introduces semantic prior of cognition and psychology into machine learning, and studies object detection, cognition and navigation in complex scenes. In recent years, the researches were published in the high level journals (TPAMI, TIP, etc.) and international conferences (CVPR, NIPS, etc.).

Xiong Luo received the PhD degree in computer applied technology from Central South University, China in 2004. He is currently a Professor with the School of Computer and Communication Engineering, University of Science and Technology Beijing, China. His current research interests include machine learning, computer vision, and computational intelligence. He has published extensively in his areas of interest in several journals, such as IEEE Transactions on Industrial Informatics, IEEE Transactions on Human-Machine Systems, and IEEE Transactions on Network Science and Engineering.

Jian Yuan received the PhD degree in electrical engineering from the University of Electronic Science and Technology of China, China in 1998. He is currently a Professor with the Department of Electronic Engineering, Tsinghua University, China. His current research focuses on the complex dynamics of networked systems.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Ma, H., Luo, X. et al. LIDAR: learning from imperfect demonstrations with advantage rectification. Front. Comput. Sci. 16, 161312 (2022). https://doi.org/10.1007/s11704-021-0147-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-0147-9

Keywords

Navigation