Abstract
Deep reinforcement learning has been successfully applied to the generation of goal-directed behavior in artificial agents. However, existing algorithms are often not designed to reproduce human-like behavior, which may be desired in many environments, such as human–robot collaborations, social robotics and autonomous vehicles. Here we introduce a model-free and easy-to-implement deep reinforcement learning approach to mimic the stochastic behavior of a human expert by learning distributions of task variables from examples. As tractable use-cases, we study static and dynamic obstacle avoidance tasks for an autonomous vehicle on a highway road in simulation (Unity). Our control algorithm receives a feedback signal from two sources: a deterministic (handcrafted) part encoding basic task goals and a stochastic (data-driven) part that incorporates human expert knowledge. Gaussian processes are used to model human state distributions and to assess the similarity between machine and human behavior. Using this generic approach, we demonstrate that the learning agent acquires human-like driving skills and can generalize to new roads and obstacle distributions unseen during training.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Video 1: Generalization I: Comparison of the performance of the vRL- and mRL-agent in an obstacle avoidance task on road track 1 with random obstacle distribution.
Video 2: Generalization II: Testing the vRL-agent in an obstacle avoidance task on road track 1 with three different obstacle distributions (A:random, B: Gaussian, C: Batch).
Video 3: Testing the vRL-agent in an overtaking task on road track 2. The same learning algorithm (PPO) and network architecture (except for the input layers) as for the obstacle avoidance task were used.
References
Li Y (2017) Deep reinforcement learning: an overview. arXiv preprint arXiv:1701.07274.
François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J et al (2018) An introduction to deep reinforcement learning. Found Trends Mach Learn 11(3–4):219–354
Heess N, TB D, Sriram S, Lemmon J, Merel J, Wayne G, Tassa Y, Erez T, Wang Z, Eslami S et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286
Schwarting W, Pierson A, Alonso-Mora J, Karaman S, Rus D (2019) Social behavior for autonomous vehicles. Proc Natl Acad Sci 116(50):24972–24978
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning 37:1889–1897
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems. p 2672–2680
Ho J, Ermon S (2016) Generative adversarial imitation learning. In: Advances in neural information processing systems, vol 26. p 4565–4573
Ranney TA (1994) Models of driving behavior: a review of their evolution. Accid Anal Prev 26(6):733–750
Fuller R (2005) Towards a general theory of driver behaviour. Accid Anal Prev 37(3):461–472
Plöchl M, Edelmann J (2007) Driver models in automobile dynamics application. Veh Syst Dyn 45(7–8):699–741
Grigorescu S, Trasnea B, Cocias T, Macesanu G (2020) A survey of deep learning techniques for autonomous driving. J Field Robot 37:362–386
Fridman L, Brown DE, Glazer M, Angell W, Dodd S, Jenik B, Terwilliger J, Patsekin A, Kindelsberger J, Ding L et al (2019) MIT advanced vehicle technology study: large-scale naturalistic driving study of driver behavior and interaction with automation. IEEE Access 7:102021–102038
Kiran BR, Sobh I, Talpaert V, Mannion P, Al Sallab AA, Yogamani S, Pérez P (2021) Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transp Syst
Kuutti S, Bowden R, Jin Y, Barber P, Fallah S (2020) A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst 22(2):712–733
Zhu Z, Zhao H (2021) A survey of deep rl and il for autonomous driving policy learning. IEEE Trans Intell Transp Syst
Peng XB, Abbeel P, Levine S, van de Panne M (2018) Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Graph (TOG) 37(4):143
Lu C, Wang H, Lv C, Gong J, Xi J, Cao D (2018) Learning driver-specific behavior for overtaking: a combined learning framework. IEEE Trans Veh Technol 67(8):6788–6802
Zhu M, Wang X, Wang Y (2018) Human-like autonomous car-following model with deep reinforcement learning. Transport Res Part C 97:348–368
Osa T, Pajarinen J, Neumann G, Bagnell JA, Abbeel P, Peters J et al (2018) An algorithmic perspective on imitation learning. Founda Trends Robot 7(1–2):1–179
Ng A.Y, Russell SJ (2000) et al. (2000) Algorithms for inverse reinforcement learning. ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning, 663–670
Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. ICML '04: Proceedings of the twenty-first International Conference on Machine Learning, 2004
Kuderer M, Gulati S, Burgard W (2015) Learning driving styles for autonomous vehicles from demonstration. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). p 2641–2646. https://doi.org/10.1109/ICRA.2015.7139555
Levine S, Popovic Z, Koltun V (2011) Nonlinear inverse reinforcement learning with gaussian processes. In: Advances in Neural Information Processing Systems vol 24. p 19–27
Levine S, Koltun V (2012) Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617
Udacity: (2017) Udacity’s self-driving car simulator. https://github.com/udacity/self-driving-car-sim
Udacity: (2017) Self-driving car engineer nanodegree program. https://github.com/udacity/CarND-Path-Planning-Project
Leung K, Schmerling E, Pavone M (2016) Distributional prediction of human driving behaviours using mixture density networks. Stanford University, Stanford
Borrelli F, Falcone P, Keviczky T, Asgari J, Hrovat D (2005) MPC-based approach to active steering for autonomous vehicle systems. Int J Veh Auton Syst 3(2):265–291
Kong J, Pfeiffer M, Schildbach G, Borrelli F (2015) Kinematic and dynamic vehicle models for autonomous driving control design. In: 2015 IEEE Intelligent vehicles symposium (IV), 1094–1099
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
Bishop C.M (1994) Mixture density networks. Neural Computing Research Group Report: NCRG/94/004
Zolna K, Reed S, Novikov A, Colmenarej SG, Budden D, Cabi S, Denil M, de Freitas N, Wang Z (2019) Task-relevant adversarial imitation learning. arXiv preprint arXiv:1910.01077
Peng XB, Kanazawa A, Toyer S, Abbeel P, Levine S (2018) Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. arXiv preprint arXiv:1810.00821
Wang R, Ciliberto C, Amadori PV, Demiris Y (2019) Random expert distillation: Imitation learning via expert policy support estimation. In: International Conference on Machine Learning, PMLR Vol 97. p 6536–6544
Cobbe K, Klimov O, Hesse C, Kim T, Schulman J (2018) Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341
Funding
This research was supported in part by the Helmsley Charitable Trust through the Agricultural, Biological and Cognitive Robotics Initiative and by the Marcus Endowment Fund both at Ben-Gurion University of the Negev. This research was supported by the Israel Science Foundation (Grant No. 1627/17): Leona M. and Harry B. Helmsley Charitable Trust.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Hyperparameters
see Table 4.
Appendix B: Dynamic batch size update

The (Boolean) parameter \(\textit{completed}\_{rounds}\) indicates whether the agent completed two rounds or not. The parameter \(\textit{max}\_length\) is the maximum number of steps the agent has taken in a row from the start of an episode to its end. The \(\textit{rem}\) parameter is the number of steps remaining to reach twice
the \(\textit{max}\_length\). The \(\textit{rem}\) guarantees that the new batch (B) can be divided into equal mini-batches (MB). B and MB were initialized to 512 and 256, respectively.
Appendix C: Visual turing test
Video A: human, video B: robot.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Emuna, R., Duffney, R., Borowsky, A. et al. Example-guided learning of stochastic human driving policies using deep reinforcement learning. Neural Comput & Applic 35, 16791–16804 (2023). https://doi.org/10.1007/s00521-022-07947-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07947-2