Bestärkendes Lernen mittels Offline-Trajektorienplanung basierend auf iterativ approximierten Modellen

Max Pritzkoleit; Robert Heedt; Carsten Knoll; Klaus Röbenack

doi:10.1515/auto-2020-0024

Published by De Gruyter (O) July 31, 2020

Bestärkendes Lernen mittels Offline-Trajektorienplanung basierend auf iterativ approximierten Modellen

Reinforcement learning via offline trajectory planning based on iteratively approximated models

Max Pritzkoleit
Dipl.-Ing. Max Pritzkoleit ist Absolvent und Stipendiat des Instituts für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Bestärkendes Lernen zur Steuerung und Regelung dynamischer Systeme.
, Robert Heedt
Dipl.-Ing. Robert Heedt ist Absolvent und Stipendiat des Instituts für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Entwurf und Implementierung von Folgereglern für mechanische Systeme, abstrakte Wissensrepräsentation.
, Carsten Knoll
Dr.-Ing. Carsten Knoll ist wissenschaftlicher Mitarbeiter am Institut für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Trajektorienplanung und Folgeregelung für nichtlineare Systeme, insbesondere unter Einsatz des maschinellens Lernens, semantische Wissensrepräsentation und Ontologien.
and Klaus Röbenack
Prof. Dr.-Ing. habil. Klaus Röbenack ist Direktor des Instituts für Regelungs- und Steuerungstheorie an der Fakultät Elektrotechnik und Informationstechnik der Technischen Universität Dresden. Arbeitsgebiete: Nichtlinearer Regler- und Beobachterentwurf, Quantorenelimination, algorithmisches Differenzieren.

From the journal at - Automatisierungstechnik

https://doi.org/10.1515/auto-2020-0024

Showing a limited preview of this publication:

Zusammenfassung

In diesem Beitrag nutzen wir Künstliche Neuronale Netze (KNN) zur Approximation der Dynamik nichtlinearer (mechanischer) Systeme. Diese iterativ approximierten neuronalen Systemmodelle werden in einer Offline-Trajektorienplanung verwendet, um eine optimale Rückführung zu bestimmen, welche auf das reale System angewandt wird. Dieser Ansatz des modellbasierten bestärkenden Lernens (engl. model-based reinforcement learning (RL)) wird am Aufschwingen des Einfachwagenpendels zunächst simulativ evaluiert und zeigt gegenüber modellfreien RL-Ansätzen eine signifikante Verbesserung der Dateneffizienz. Weiterhin zeigen wir Experimentalergebnisse an einem Versuchsstand, wobei der vorgestellte Algorithmus innerhalb weniger Versuche in der Lage ist, eine für das System optimale Rückführung hinreichend gut zu approximieren.

Abstract

In this paper we use artificial neural networks (ANN) to approximate the dynamics of nonlinear (mechanical) systems. These iteratively approximated neural system models are used in an offline trajectory planning to calculate an optimal feedback law that is applied to the real system. This model-based reinforcement learning (RL) approach is evaluated on the swing-up manoeuvre of the cart-pole system and shows a significant performance gain in terms of data efficiency compared to model-free RL approaches. Furthermore, we show experimental results on a test bench. The proposed algorithm is capable of approximating an optimal feedback law for the system after only a few trials.

Schlagwörter: Trajektorienplanung; Bestärkendes Lernen; Neuronale Netzwerke; Modellapproximation; Folgeregelung

Keywords: trajectory planning; reinforcement learning; neural networks; model approximation; tracking control

Über die Autoren

Max Pritzkoleit

Dipl.-Ing. Max Pritzkoleit ist Absolvent und Stipendiat des Instituts für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Bestärkendes Lernen zur Steuerung und Regelung dynamischer Systeme.

Robert Heedt

Dipl.-Ing. Robert Heedt ist Absolvent und Stipendiat des Instituts für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Entwurf und Implementierung von Folgereglern für mechanische Systeme, abstrakte Wissensrepräsentation.

Carsten Knoll

Dr.-Ing. Carsten Knoll ist wissenschaftlicher Mitarbeiter am Institut für Regelungs- und Steuerungstheorie (TU Dresden). Arbeitsgebiete: Trajektorienplanung und Folgeregelung für nichtlineare Systeme, insbesondere unter Einsatz des maschinellens Lernens, semantische Wissensrepräsentation und Ontologien.

Klaus Röbenack

Prof. Dr.-Ing. habil. Klaus Röbenack ist Direktor des Instituts für Regelungs- und Steuerungstheorie an der Fakultät Elektrotechnik und Informationstechnik der Technischen Universität Dresden. Arbeitsgebiete: Nichtlinearer Regler- und Beobachterentwurf, Quantorenelimination, algorithmisches Differenzieren.

Danksagung

Die Autoren danken dem Auditorium auf dem Workshop des GMA FA 1.30 für die anregenden Diskussionen sowie dem Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) der TU Dresden für die Bereitstellung von Rechenzeit auf dem System HRSK-II.

Literatur

1. S. Bechtle u. a. „Curious iLQR: Resolving Uncertainty in Model-based RL.“ In: arXiv:1904.06786 (preprint) (2019).Search in Google Scholar

2. D. P. Bertsekas. Dynamic Programming and Optimal Control. 3. Auflage. Bd. 1. Bellmont, MA: Athena Sientific, 2005.Search in Google Scholar

3. K. Chua u. a. „Deep reinforcement learning in a handful of trials using probabilistic dynamics models.“ In: Advances in Neural Information Processing Systems. 2018, S. 4754–4765.Search in Google Scholar

4. M. Deisenroth and C. E. Rasmussen. „PILCO: A model-based and data-efficient approach to policy search.“ In: Proceedings of the 28th International Conference on machine learning (ICML-11). 2011, S. 465–472.Search in Google Scholar

5. M. P. Deisenroth, G. Neumann, J. Peters u. a. „A survey on policy search for robotics.“ In: Foundations and Trends in Robotics 2.1-2 (2013), S. 1–142.10.1561/9781601987037Search in Google Scholar

6. F. Farshidian and J. Buchli. „Risk sensitive, nonlinear optimal control: Iterative linear exponential-quadratic optimal control with gaussian noise.“ In: arXiv:1512.07173 (preprint) (2015).Search in Google Scholar

7. K. Kaheman u. a. „Learning Discrepancy Models From Experimental Data.“ In: arXiv:1909.08574 (preprint) (2019).Search in Google Scholar

8. B. Lakshminarayanan, A. Pritzel and C. Blundell. „Simple and scalable predictive uncertainty estimation using deep ensembles.“ In: Advances in Neural Information Processing Systems. 2017, S. 6402–6413.Search in Google Scholar

9. G. Lee, S. S. Srinivasa and M. T. Mason. „GP-iLQG: Data-driven Robust Optimal Control for Uncertain Nonlinear Dynamical Systems.“ In: arXiv:1705.05344 (preprint) (2017).Search in Google Scholar

10. W. Li and E. Todorov. „Iterative linear quadratic regulator design for nonlinear biological movement systems.“ In: Proceedings of the 1st International Confenrence on Informatics in Control, Automation and Robotics. 2004, S. 222–229.Search in Google Scholar

11. T. P. Lillicrap u. a. „Continuous control with deep reinforcement learning.“ In: arXiv:1509.02971 (preprint) (2016).Search in Google Scholar

12. Z. Manchester and S. Kuindersma. „Derivativefree trajectory optimization with unscented dynamic programming.“ In: 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE. 2016, S. 3642–3647.10.1109/CDC.2016.7798817Search in Google Scholar

13. D. Mayne. „A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems.“ In: International Journal of Control 3.1 (1966), S. 85–95.10.1080/00207176608921369Search in Google Scholar

14. D. Mitrovic, S. Klanke and S. Vijayakumar. „Optimal control with adaptive internal dynamics models.“ In: Proceedings of the 5th International Confenrence on Informatics in Control, Automation and Robotics. 2008.Search in Google Scholar

15. V. Mnih u. a. „Human-level control through deep reinforcement learning.“ In: Nature 518.7540 (2015), S. 529–533.10.1038/nature14236Search in Google Scholar PubMed

16. A. Nagabandi u. a. „Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning.“ In: 2018 IEEE International Conference on Robotics and Automation (ICRA). 2018, S. 7559–7566.10.1109/ICRA.2018.8463189Search in Google Scholar

17. Y. Pan and E. Theodorou. „Probabilistic differential dynamic programming.“ In: Advances in Neural Information Processing Systems. 2014, S. 1907–1915.Search in Google Scholar

18. M. Pritzkoleit. „Bestärkendes Lernen zur Steuerung und Regelung nichtlinearer dynamischer Systeme.“ Diplomarbeit. Technische Universität Dresden, Juni 2019.Search in Google Scholar

19. M. Pritzkoleit, C. Knoll and K. Röbenack. „Reinforcement Learning and Trajectory Planning based on Model Approximation with Neural Networks applied to Transition Problems.“ In: Proceedings of the 21st IFAC World Congress in Berlin. IFAC. 2020.10.1016/j.ifacol.2020.12.2193Search in Google Scholar

20. K. Röbenack and K. J. Reinschke. „Reglerentwurf mit Hilfe des Automatischen Differenzierens.“ In: at – Automatisierungstechnik 48.2 (2000), S. 60–66.10.1524/auto.2000.48.2.60Search in Google Scholar

21. K. Röbenack. Nichtlineare Regelungssysteme: Theorie und Anwendung der exakten Linearisierung. Springer Verlag, 2017.10.1007/978-3-662-44091-9Search in Google Scholar

22. E. Rückert u. a. „Learned graphical models for probabilistic planning provide a new class of movement primitives.“ In: Frontiers in Computational Neuroscience 6 (2013), S. 97.10.3389/fncom.2012.00097Search in Google Scholar PubMed PubMed Central

23. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.Search in Google Scholar

24. R. S. Sutton, A. G. Barto and R. J. Williams. „Reinforcement learning is direct adaptive optimal control.“ In: IEEE Control Systems 12.2 (1992), S. 19–22.10.23919/ACC.1991.4791776Search in Google Scholar

25. Y. Tassa, N. Mansard and E. Todorov. „Control-limited differential dynamic programming.“ In: 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2014, S. 1168–1175.10.1109/ICRA.2014.6907001Search in Google Scholar

26. E. Todorov and W. Li. „A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems.“ In: Proceedings of the 2005, American Control Conference, 2005. IEEE. 2005, S. 300306.Search in Google Scholar

27. M. Toussaint. „Robot trajectory optimization using approximate inference.“ In: Proceedings of the 26th annual international conference on machine learning. 2009, S. 1049–1056.10.1145/1553374.1553508Search in Google Scholar

28. C. J. Watkins and P. Dayan. „Q-learning.“ In: Machine learning 8.3-4 (1992), S. 279–292.10.1007/BF00992698Search in Google Scholar

29. A. Yamaguchi and C. G. Atkeson. „Neural networks and differential dynamic programming for reinforcement learning problems.“ In: 2016 IEEE International Conference on Robotics and Automation (ICRA). 2016, S. 5434–5441.10.1109/ICRA.2016.7487755Search in Google Scholar

Erhalten: 2020-02-19

Angenommen: 2020-06-16

Online erschienen: 2020-07-31

Erschienen im Druck: 2020-08-27

Bestärkendes Lernen mittels Offline-Trajektorienplanung basierend auf iterativ approximierten Modellen

Zusammenfassung

Abstract

Über die Autoren

Danksagung

Literatur

Journal and Issue

Articles in the same Issue