Abstract
This paper is concerned with robust learning to simulate (RL2S), a new problem of reinforcement learning (RL) that focuses on learning a high-fidelity environment model (i.e., simulator) for serving diverse downstream tasks. Different from the environment learning in model-based RL, where the learned dynamics model is only appropriate to provide simulated data for the specific policy, the goal of RL2S is to build a simulator that is of high fidelity when interacting with various policies. Thus the robustness (i.e., the ability to provide accurate simulations to various policies) of the simulator over diverse corner cases (policies) is the key challenge to address. Via formulating the policy-environment as a dual Markov decision process, we transform RL2S as a novel robust imitation learning problem and propose efficient algorithms to solve it. Experiments on continuous control scenarios demonstrate that the RL2S enabled methods outperform the others on learning high-fidelity simulators for evaluating, ranking and training various policies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Considering Ant has an especially larger state and action dimension than other environments, we sample more policies for training and test.
References
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: NeurIPS, pp. 4759–4770 (2018)
Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS (2014)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870. PMLR (2018)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, pp. 4565–4573 (2016)
Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50(2), 21 (2017)
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS, pp. 12519–12530 (2019)
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016)
Li, Y., Song, J., Ermon, S.: InfoGAIL: interpretable imitation learning from visual demonstrations. In: Advances in Neural Information Processing Systems, pp. 3812–3822 (2017)
Lin, Z., Thomas, G., Yang, G., Ma, T.: Model-based adversarial meta-reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., Ma, T.: Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: ICLR (Poster) (2019)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Nachum, O., Chow, Y., Dai, B., Li, L.: DualDICE: behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733 (2019)
Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: ICRA (2018)
Ng, A.Y., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML (2000)
Nilim, A., El Ghaoui, L.: Robustness in Markov decision problems with uncertain transition matrices. In: NIPS, pp. 839–846. Citeseer (2003)
Paduraru, C.: Off-policy evaluation in Markov decision processes. Ph.D. thesis, Ph.D. dissertation. McGill University (2012)
Paine, T.L., et al.: Hyperparameter selection for offline reinforcement learning. arXiv preprint arXiv:2007.09055 (2020)
Pinto, L., Davidson, J., Sukthankar, R., Gupta, A.: Robust adversarial reinforcement learning. In: International Conference on Machine Learning, pp. 2817–2826. PMLR (2017)
Pomerleau, D.A.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3(1), 88–97 (1991)
Rajeswaran, A., Ghotra, S., Ravindran, B., Levine, S.: EPOpt: learning robust neural network policies using model ensembles. In: ICLR (2016)
Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: AISTATS, pp. 661–668. JMLR Workshop and Conference Proceedings (2010)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Shang, W., Yu, Y., Li, Q., Qin, Z., Meng, Y., Ye, J.: Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In: KDD (2019)
Shi, J.C., Yu, Y., Da, Q., Chen, S.Y., Zeng, A.X.: Virtual-Taobao: virtualizing real-world online retail environment for reinforcement learning. In: AAAI, vol. 33, pp. 4902–4909 (2019)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Syed, U., Bowling, M., Schapire, R.E.: Apprenticeship learning using linear programming. In: ICML, pp. 1032–1039. ACM (2008)
Tamar, A., Glassner, Y., Mannor, S.: Optimizing the CVaR via sampling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Thomas, P., Brunskill, E.: Data-efficient off-policy policy evaluation for reinforcement learning. In: International Conference on Machine Learning, pp. 2139–2148. PMLR (2016)
Wu, Y.H., Fan, T.H., Ramadge, P.J., Su, H.: Model imitation for model-based reinforcement learning. arXiv preprint arXiv:1909.11821 (2019)
Xu, T., Li, Z., Yu, Y.: Error bounds of imitating policies and environments. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Zhang, H., et al.: Learning to design games: Strategic environments in reinforcement learning. IJCAI (2018)
Zhang, H., Chen, H., Xiao, C., Li, B., Boning, D., Hsieh, C.J.: Robust deep reinforcement learning against adversarial perturbations on observations. arXiv:2003.08938 (2020)
Zhang, H., et al.: CityFlow: a multi-agent reinforcement learning environment for large scale city traffic scenario. In: The World Wide Web Conference, pp. 3620–3624 (2019)
Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., Tang, J.: Deep reinforcement learning for page-wise recommendations. In: RecSys, pp. 95–103 (2018)
Zheng, G., Liu, H., Xu, K., Li, Z.: Learning to simulate vehicle trajectories from demonstrations. In: ICDE, pp. 1822–1825. IEEE (2020)
Zhou, M., et al.: Smarts: scalable multi-agent reinforcement learning training school for autonomous driving. In: Conference on Robot Learning (2020)
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI, Chicago, IL, USA, vol. 8, pp. 1433–1438 (2008)
Acknowledgements
The authors from Shanghai Jiao Tong University are supported by “New Generation of AI 2030” Major Project (2018AAA0100900), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and National Natural Science Foundation of China (62076161, 81771937). The work is also sponsored by Huawei Innovation Research Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, W. et al. (2021). Learning to Build High-Fidelity and Robust Environment Models. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham. https://doi.org/10.1007/978-3-030-86486-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-86486-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86485-9
Online ISBN: 978-3-030-86486-6
eBook Packages: Computer ScienceComputer Science (R0)