ABSTRACT
Offline imitative learning(OIL) is often used to solve complex continuous decision-making tasks. For these tasks such as robot control, automatic driving and etc., it is either difficult to design an effective reward for learning or very expensive and time-consuming for agents to collect data interactively with the environment. However, the data used in previous OIL methods are all gathered by reinforcement learning algorithms guided by task-specific rewards, which is not a true reward-free premise and still suffers from the problem of designing an effective reward function in real tasks. To this end, we propose the reward-free exploratory data driven offline imitation learning (ExDOIL) framework. ExDOIL first trains an unsupervised reinforcement learning agent by interacting with the environment, and collects enough unsupervised exploration data during training; Then, a task independent yet simple and efficient reward function is used to relabel the collected data; Finally, an agent is trained to imitate the expert to complete the task through a conventional RL algorithm such as TD3. Extensive experiments on continuous control tasks demonstrate that the proposed framework can achieve better imitation performance(28% higher episode returns on average) comparing with previous SOTA method(ORIL) without any task-specific rewards.
- Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. 1.Google ScholarDigital Library
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR, 214–223.Google Scholar
- Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning.. In Machine Intelligence 15. 103–129.Google Scholar
- Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. 2018. Exploration by random network distillation. arXiv preprint arXiv:1810.12894(2018).Google Scholar
- Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33 (2020), 9912–9924.Google Scholar
- Kamil Ciosek. 2021. Imitation learning by reinforcement learning. arXiv preprint arXiv:2108.04763(2021).Google Scholar
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2018. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070(2018).Google Scholar
- Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219(2020).Google Scholar
- Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems 34 (2021), 20132–20145.Google Scholar
- Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. In International conference on machine learning. PMLR, 1587–1596.Google Scholar
- Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052–2062.Google Scholar
- Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, 2020. Rl unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 7248–7259.Google Scholar
- Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016).Google Scholar
- Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On convergence and stability of gans. arXiv preprint arXiv:1705.07215(2017).Google Scholar
- Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. 2019. Imitation learning via off-policy distribution matching. arXiv preprint arXiv:1912.05032(2019).Google Scholar
- Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.Google Scholar
- Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. 2020. Reinforcement learning with augmented data. Advances in neural information processing systems 33 (2020), 19884–19895.Google Scholar
- Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. 2021. URLB: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191(2021).Google Scholar
- Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdinov. 2019. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274(2019).Google Scholar
- Hao Liu and Pieter Abbeel. 2021. Aps: Active pretraining with successor features. In International Conference on Machine Learning. PMLR, 6736–6747.Google Scholar
- Hao Liu and Pieter Abbeel. 2021. Behavior from the void: Unsupervised active pre-training. Advances in Neural Information Processing Systems 34 (2021), 18459–18473.Google Scholar
- Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning. PMLR, 2778–2787.Google ScholarCross Ref
- Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. 2019. Self-supervised exploration via disagreement. In International conference on machine learning. PMLR, 5062–5071.Google Scholar
- Siddharth Reddy, Anca D Dragan, and Sergey Levine. 2019. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108(2019).Google Scholar
- Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, 2018. Deepmind control suite. arXiv preprint arXiv:1801.00690(2018).Google Scholar
- R. Wang. 2006. Reinforcement Learning: An Introduction. In 2006 International Conference on Artificial Intelligence: 50 Years’ Achievements, Future Directions and Social Impacts.Google Scholar
- Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, 2020. Critic regularized regression. Advances in Neural Information Processing Systems 33 (2020), 7768–7778.Google Scholar
- Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. 2022. Don’t Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning. arXiv preprint arXiv:2201.13425(2022).Google Scholar
- Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. 2021. Reinforcement learning with prototypical representations. In International Conference on Machine Learning. PMLR, 11920–11931.Google Scholar
- Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. 2020. Offline learning from demonstrations and unlabeled experience. arXiv preprint arXiv:2011.13885(2020).Google Scholar
Index Terms
- Offline Imitation Learning Using Reward-free Exploratory Data
Recommendations
Imitation Learning: A Survey of Learning Methods
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been ...
Disentangled Representation Learning for Generative Adversarial Multi-task Imitation Learning
CCRIS '23: Proceedings of the 2023 4th International Conference on Control, Robotics and Intelligent SystemMulti-task imitation learning (MTIL) is an effective approach to training an autonomous agent that is capable of performing multiple tasks using multi-task expert demonstrations. Since different tasks often share similarities, learning them ...
Improve generated adversarial imitation learning with reward variance regularization
AbstractImitation learning aims at recovering expert policies from limited demonstration data. Generative Adversarial Imitation Learning (GAIL) employs the generative adversarial learning framework for imitation learning and has shown great potentials. ...
Comments