Abstract
In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.
Similar content being viewed by others
References
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv:1707.06347
Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1582–1591
Lakshminarayanan A S, Ozair S, Bengio, Y. Reinforcement learning with few expert demonstrations. In: Proceedings of Neural Information Processing Systems Workshop on Deep Learning for Action and Interaction. 2016
Rajeswaran A, Kumar V, Gupta A, Vezzani G, Schulman J, Todorov E, Levine S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. 2017, arXiv preprint arXiv:1709.10087
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P. Overcoming exploration in reinforcement learning with demonstrations. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 6292–6299
Ebrahimi S, Rohrbach A, Darrell T. Gradient-free policy architecture search and adaptation. In: Proceedings of the Conference on Robot Learning. 2017, 505–514
Reddy S, Dragan A D, Levine, S. SQIL: imitation learning via regularized behavioral cloning. 2019, arXiv preprint arXiv:1905.11108
Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4565–4573
Todorov E, Erez T, Tassa Y. Mujoco: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033
Kang B Y, Jie Z Q, Feng J S. Policy optimization with demonstrations. In: Proceedings of International Conference on Machine Learning. 2018, 2469–2478
Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, Levine S. Soft actor-critic algorithms and applications. 2018, arXiv preprint arXiv:1812.05905
Ng A Y, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 278–287
Brys T, Harutyunyan A, Suay H B, Chernova S, Taylor M E, Nowé A. Reinforcement learning from demonstration through shaping. In: Proceedings of the 32nd International Joint Conferences on Artificial Intelligence. 2015, 3352–3358
Jing M X, Ma X J, Huang W B, Sun F C, Yang C, Fang B, Liu H P. Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2020, 5109–5116
Schulman J, Levine S, Abbeel P, Jordan M I, Moritz P. Trust region policy optimization. In: Proceedings of International Conference on Machine Learning. 2015, 1889–1897
Abbeel P, Ng A Y. Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the 21st International Conference on Machine Learning. 2004
Ng A Y, Russell S J. Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 663–670
Li Y Z, Song J M, Ermon S. InfoGAIL: interpretable imitation learning from visual demonstrations. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3815–3825
Shiarlis K, Messias J, Whiteson S. Inverse reinforcement learning from failure. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 1060–1068
Brown D S, Goo W, Nagarajan P, Niekum S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: Proceedings of the International Conference on Machine Learning. 2019, 783–792
Nagarajan P. Inverse reinforcement learning via ranked and failed demonstrations. 2016
Oh J, Guo Y J, Singh S, Lee H. Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 3878–3887
Wu Y H, Charoenphakdee N, Bao H, Tangkaratt V, Sugiyama M. Imitation learning from imperfect demonstration. In: Proceedings of International Conference on Machine Learning. 2019, 6818–6827
Sun W, Bagnell J A, Boots B. Truncated horizon policy search: combining reinforcement learning & imitation learning. In: Proceedings of the 7th International Conference on Learning Representations. 2018
Vecerík M, Hester T, Scholz J, Wang F M, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M A. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. 2017, arXiv preprint arXiv:1707.08817
Gao Y, Xu H, Lin J, Yu F, Levine S, Darrell T. Reinforcement learning from imperfect demonstrations. 2018, arXiv preprint arXiv:1802.05313
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 387–395
Sutton R S, McAllester D A, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057–1063
Munos R, Stepleton T, Harutyunyan A, Bellemare M. Safe and efficient off-policy reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1054–1062
Kakade S, Langford J. Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. 2002, 267–274
Hasselt H. Double Q-learning. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. 2010, 2613–2621
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. Openai gym. 2016, arXiv preprint arXiv:1606.01540
Acknowledgements
This work was supported by the National Key R&D Plan (2016YFB0100901), the National Natural Science Foundation of China (Grant Nos. U20B2062 & 61673237) and the Beijing Municipal Science & Technology Project (Z191100007419001).
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiaoqin Zhang is a PhD student in Department of Electronic Engineering, Tsinghua University, China where he received his BS degree in 2015. His current research interests are reinforcement learning and robotics, with particular interest in learning from demonstrations.
Huimin Ma is a professor in the School of computer and communication, University of Science and Technology, China. She is the dean of the Department of Internet of Things and Electronic Engineering and the vice president of Institute of artificial intelligence. She was the director of 3D Image Lab in the Department of Electronic Engineering of Tsinghua University, China. She is also the secretary-general of China Society of Image and Graphics. Her research interests is 3D image cognition and simulation. She introduces semantic prior of cognition and psychology into machine learning, and studies object detection, cognition and navigation in complex scenes. In recent years, the researches were published in the high level journals (TPAMI, TIP, etc.) and international conferences (CVPR, NIPS, etc.).
Xiong Luo received the PhD degree in computer applied technology from Central South University, China in 2004. He is currently a Professor with the School of Computer and Communication Engineering, University of Science and Technology Beijing, China. His current research interests include machine learning, computer vision, and computational intelligence. He has published extensively in his areas of interest in several journals, such as IEEE Transactions on Industrial Informatics, IEEE Transactions on Human-Machine Systems, and IEEE Transactions on Network Science and Engineering.
Jian Yuan received the PhD degree in electrical engineering from the University of Electronic Science and Technology of China, China in 1998. He is currently a Professor with the Department of Electronic Engineering, Tsinghua University, China. His current research focuses on the complex dynamics of networked systems.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Zhang, X., Ma, H., Luo, X. et al. LIDAR: learning from imperfect demonstrations with advantage rectification. Front. Comput. Sci. 16, 161312 (2022). https://doi.org/10.1007/s11704-021-0147-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-021-0147-9