Skip to main content

Advertisement

Log in

Example-guided learning of stochastic human driving policies using deep reinforcement learning

  • S.I.: Human-aligned Reinforcement Learning for Autonomous Agents and Robots
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Deep reinforcement learning has been successfully applied to the generation of goal-directed behavior in artificial agents. However, existing algorithms are often not designed to reproduce human-like behavior, which may be desired in many environments, such as human–robot collaborations, social robotics and autonomous vehicles. Here we introduce a model-free and easy-to-implement deep reinforcement learning approach to mimic the stochastic behavior of a human expert by learning distributions of task variables from examples. As tractable use-cases, we study static and dynamic obstacle avoidance tasks for an autonomous vehicle on a highway road in simulation (Unity). Our control algorithm receives a feedback signal from two sources: a deterministic (handcrafted) part encoding basic task goals and a stochastic (data-driven) part that incorporates human expert knowledge. Gaussian processes are used to model human state distributions and to assess the similarity between machine and human behavior. Using this generic approach, we demonstrate that the learning agent acquires human-like driving skills and can generalize to new roads and obstacle distributions unseen during training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/emunaran/stochastic-human-driving-policies-drl

  2. Video 1: Generalization I: Comparison of the performance of the vRL- and mRL-agent in an obstacle avoidance task on road track 1 with random obstacle distribution.

  3. Video 2: Generalization II: Testing the vRL-agent in an obstacle avoidance task on road track 1 with three different obstacle distributions (A:random, B: Gaussian, C: Batch).

  4. Video 3: Testing the vRL-agent in an overtaking task on road track 2. The same learning algorithm (PPO) and network architecture (except for the input layers) as for the obstacle avoidance task were used.

References

  1. Li Y (2017) Deep reinforcement learning: an overview. arXiv preprint arXiv:1701.07274.

  2. François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J et al (2018) An introduction to deep reinforcement learning. Found Trends Mach Learn 11(3–4):219–354

    Article  MATH  Google Scholar 

  3. Heess N, TB D, Sriram S, Lemmon J, Merel J, Wayne G, Tassa Y, Erez T, Wang Z, Eslami S et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286

  4. Schwarting W, Pierson A, Alonso-Mora J, Karaman S, Rus D (2019) Social behavior for autonomous vehicles. Proc Natl Acad Sci 116(50):24972–24978

    Article  MathSciNet  MATH  Google Scholar 

  5. Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

  6. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning 37:1889–1897

  7. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

  8. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems. p 2672–2680

  9. Ho J, Ermon S (2016) Generative adversarial imitation learning. In: Advances in neural information processing systems, vol 26. p 4565–4573

  10. Ranney TA (1994) Models of driving behavior: a review of their evolution. Accid Anal Prev 26(6):733–750

    Article  Google Scholar 

  11. Fuller R (2005) Towards a general theory of driver behaviour. Accid Anal Prev 37(3):461–472

    Article  Google Scholar 

  12. Plöchl M, Edelmann J (2007) Driver models in automobile dynamics application. Veh Syst Dyn 45(7–8):699–741

    Article  Google Scholar 

  13. Grigorescu S, Trasnea B, Cocias T, Macesanu G (2020) A survey of deep learning techniques for autonomous driving. J Field Robot 37:362–386

  14. Fridman L, Brown DE, Glazer M, Angell W, Dodd S, Jenik B, Terwilliger J, Patsekin A, Kindelsberger J, Ding L et al (2019) MIT advanced vehicle technology study: large-scale naturalistic driving study of driver behavior and interaction with automation. IEEE Access 7:102021–102038

    Article  Google Scholar 

  15. Kiran BR, Sobh I, Talpaert V, Mannion P, Al Sallab AA, Yogamani S, Pérez P (2021) Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transp Syst

  16. Kuutti S, Bowden R, Jin Y, Barber P, Fallah S (2020) A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst 22(2):712–733

  17. Zhu Z, Zhao H (2021) A survey of deep rl and il for autonomous driving policy learning. IEEE Trans Intell Transp Syst

  18. Peng XB, Abbeel P, Levine S, van de Panne M (2018) Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Graph (TOG) 37(4):143

    Google Scholar 

  19. Lu C, Wang H, Lv C, Gong J, Xi J, Cao D (2018) Learning driver-specific behavior for overtaking: a combined learning framework. IEEE Trans Veh Technol 67(8):6788–6802

    Article  Google Scholar 

  20. Zhu M, Wang X, Wang Y (2018) Human-like autonomous car-following model with deep reinforcement learning. Transport Res Part C 97:348–368

    Article  Google Scholar 

  21. Osa T, Pajarinen J, Neumann G, Bagnell JA, Abbeel P, Peters J et al (2018) An algorithmic perspective on imitation learning. Founda Trends Robot 7(1–2):1–179

    Google Scholar 

  22. Ng A.Y, Russell SJ (2000) et al. (2000) Algorithms for inverse reinforcement learning. ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning, 663–670

  23. Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. ICML '04: Proceedings of the twenty-first International Conference on Machine Learning, 2004

  24. Kuderer M, Gulati S, Burgard W (2015) Learning driving styles for autonomous vehicles from demonstration. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). p 2641–2646. https://doi.org/10.1109/ICRA.2015.7139555

  25. Levine S, Popovic Z, Koltun V (2011) Nonlinear inverse reinforcement learning with gaussian processes. In: Advances in Neural Information Processing Systems vol 24. p 19–27

  26. Levine S, Koltun V (2012) Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617

  27. Udacity: (2017) Udacity’s self-driving car simulator. https://github.com/udacity/self-driving-car-sim

  28. Udacity: (2017) Self-driving car engineer nanodegree program. https://github.com/udacity/CarND-Path-Planning-Project

  29. Leung K, Schmerling E, Pavone M (2016) Distributional prediction of human driving behaviours using mixture density networks. Stanford University, Stanford

    Google Scholar 

  30. Borrelli F, Falcone P, Keviczky T, Asgari J, Hrovat D (2005) MPC-based approach to active steering for autonomous vehicle systems. Int J Veh Auton Syst 3(2):265–291

    Article  Google Scholar 

  31. Kong J, Pfeiffer M, Schildbach G, Borrelli F (2015) Kinematic and dynamic vehicle models for autonomous driving control design. In: 2015 IEEE Intelligent vehicles symposium (IV), 1094–1099

  32. Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge

    MATH  Google Scholar 

  33. Bishop C.M (1994) Mixture density networks. Neural Computing Research Group Report: NCRG/94/004

  34. Zolna K, Reed S, Novikov A, Colmenarej SG, Budden D, Cabi S, Denil M, de Freitas N, Wang Z (2019) Task-relevant adversarial imitation learning. arXiv preprint arXiv:1910.01077

  35. Peng XB, Kanazawa A, Toyer S, Abbeel P, Levine S (2018) Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. arXiv preprint arXiv:1810.00821

  36. Wang R, Ciliberto C, Amadori PV, Demiris Y (2019) Random expert distillation: Imitation learning via expert policy support estimation. In: International Conference on Machine Learning, PMLR Vol 97. p 6536–6544

  37. Cobbe K, Klimov O, Hesse C, Kim T, Schulman J (2018) Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341

Download references

Funding

This research was supported in part by the Helmsley Charitable Trust through the Agricultural, Biological and Cognitive Robotics Initiative and by the Marcus Endowment Fund both at Ben-Gurion University of the Negev. This research was supported by the Israel Science Foundation (Grant No. 1627/17): Leona M. and Harry B. Helmsley Charitable Trust.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Armin Biess.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Hyperparameters

see Table 4.

Table 4 PPO and GAIL hyperparameters. Note that GAIL makes use of PPO

Appendix B: Dynamic batch size update

figure a

The (Boolean) parameter \(\textit{completed}\_{rounds}\) indicates whether the agent completed two rounds or not. The parameter \(\textit{max}\_length\) is the maximum number of steps the agent has taken in a row from the start of an episode to its end. The \(\textit{rem}\) parameter is the number of steps remaining to reach twice

the \(\textit{max}\_length\). The \(\textit{rem}\) guarantees that the new batch (B) can be divided into equal mini-batches (MB). B and MB were initialized to 512 and 256, respectively.

Appendix C: Visual turing test

Video A: human, video B: robot.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Emuna, R., Duffney, R., Borowsky, A. et al. Example-guided learning of stochastic human driving policies using deep reinforcement learning. Neural Comput & Applic 35, 16791–16804 (2023). https://doi.org/10.1007/s00521-022-07947-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07947-2

Keywords

Navigation