SWIRL: A SequentialWindowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards

Krishnan, Sanjay; Garg, Animesh; Liaw, Richard; Thananjeyan, Brijen; Miller, Lauren; Pokorny, Florian T.; Goldberg, Ken

doi:10.1007/978-3-030-43089-4_43

Sanjay Krishnan¹⁴,
Animesh Garg¹⁴,
Richard Liaw¹⁴,
Brijen Thananjeyan¹⁴,
Lauren Miller¹⁴,
Florian T. Pokorny¹⁵ &
…
Ken Goldberg¹⁴

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 13))

1454 Accesses

Abstract

Inverse Reinforcement Learning (IRL) allows a robot to generalize from demonstrations to previously unseen scenarios by learning the demonstrator’s reward function. However, in multi-step tasks, the learned rewards might be delayed and hard to directly optimize. We present Sequential Windowed Inverse Reinforcement Learning (SWIRL), a three-phase algorithm that partitions a complex task into shorter-horizon subtasks based on linear dynamics transitions that occur consistently across demonstrations. SWIRL then learns a sequence of local reward functions that describe the motion between transitions. Once these reward functions are learned, SWIRL applies Q-learning to compute a policy that maximizes the rewards. We compare SWIRL (demonstrations to segments to rewards) with Supervised Policy Learning (SPL - demonstrations to policies) and Maximum Entropy IRL (MaxEnt-IRL demonstrations to rewards) on standard Reinforcement Learning benchmarks: Parallel Parking with noisy dynamics, Two-Link acrobot, and a 2D GridWorld. We find that SWIRL converges to a policy with similar success rates (60%) in 3x fewer time-steps than MaxEnt-IRL, and requires 5x fewer demonstrations than SPL. In physical experiments using the da Vinci surgical robot, we evaluate the extent to which SWIRL generalizes from linear cutting demonstrations to cutting sequences of curved paths.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A Survey of Robot Learning from Demonstration. Robotics and Autonomous Systems 57(5) (2009) 469–483
Google Scholar
Kolter, J.Z., Abbeel, P., Ng, A.Y.: Hierarchical apprenticeship learning with application to quadruped locomotion. In: NIPS. (2007) 769–776
Google Scholar
Coates, A., Abbeel, P., Ng, A.Y.: Learning for control from multiple demonstrations. In: ICML, ACM (2008)
Google Scholar
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: ICML, ACM (2004) 1
Google Scholar
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml. (2000) 663–670
Google Scholar
Krishnan*, S., Garg*, A., Patil, S., Lea, C., Hager, G., Abbeel, P., Goldberg, K., (*denotes equal contribution): Transition State Clustering: Unsupervised Surgical Trajectory Segmentation For Robot Learning. In: International Symposium of Robotics Research, Springer STAR (2015)
Google Scholar
Murali*, A., Garg*, A., Krishnan*, S., Pokorny, F.T., Abbeel, P., Darrell, T., Goldberg, K., (*denotes equal contribution): TSC-DL: Unsupervised Trajectory Segmentation of Multi-Modal Surgical Demonstrations with Deep Learning. In: IEEE Int. Conf. on Robotics and Automation (ICRA). (2016)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: ICML. (1999) 278–287
Google Scholar
Judah, K., Fern, A.P., Tadepalli, P., Goetschalckx, R.: Imitation learning with demonstrations and shaping rewards. In: AAAI. (2014) 1890–1896
Google Scholar
Ijspeert, A., Nakanishi, J., Schaal, S.: Learning attractor landscapes for learning motor primitives. In: Neural Information Processing Systems (NIPS). (2002) 1523–1530
Google Scholar
Pastor, P., Hoffmann, H., Asfour, T., Schaal, S.: Learning and generalization of motor skills by learning from demonstration. In: IEEE ICRA. (2009)
Google Scholar
Manschitz, S., Kober, J., Gienger, M., Peters, J.: Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations. Robotics and Autonomous Systems (2015)
Google Scholar
Niekum, S., Osentoski, S., Konidaris, G., Barto, A.: Learning and generalization of complex tasks from unstructured demonstrations. In: Int. Conf. on Intelligent Robots and Systems (IROS), IEEE (2012)
Google Scholar
Calinon, S.: Skills learning in robots by interaction with users and environment. In: IEEE Int. Conf. on Ubiquitous Robots and Ambient Intelligence (URAI). (2014)
Google Scholar
Konidaris, G., Kuindersma, S., Grupen, R., Barto, A.: Robot Learning from Demonstration by Constructing Skill Trees. Int. Journal of Robotics Research 31(3) (2011) 360–375
Google Scholar
Ranchod, P., Rosman, B., Konidaris, G.: Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning. In: IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), IEEE (2015)
Google Scholar
Dietterich, T.G.: Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR) 13 (2000) 227–303
Google Scholar
Moldovan, T., Levine, S., Jordan, M., Abbeel, P.: Optimism-driven exploration for nonlinear systems. In: Int. Conf. on Robotics and Automation (ICRA). (2015)
Google Scholar
Khansari-Zadeh, S.M., Billard, A.: Learning stable nonlinear dynamical systems with gaussian mixture models. Robotics, IEEE Transactions on 27(5) (2011) 943–957
Google Scholar
Kruger, V., Herzog, D., Baby, S., Ude, A., Kragic, D.: Learning actions from observations. Robotics & Automation Magazine, IEEE 17(2) (2010) 30–43
Google Scholar
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via bayesian nonparametrics. arXiv preprint arXiv:1111.0352 (2011)
Mika, S., Schölkopf, B., Smola, A.J., Müller, K., Scholz, M., Rätsch, G.: Kernel PCA and de-noising in feature spaces. In: NIPS. (1998) 536–542
Google Scholar
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI. (2008)
Google Scholar
Ziebart, B., Dey, A., Bagnell, J.A.: Probabilistic pointing target prediction via inverse optimal control. In: UIST, ACM (2012) 1–10
Google Scholar
Krishnan, S., Garg, A., Liaw, R., Miller, L., Pokorny, F.T., Goldberg, K.: Hirl: Hierarchical inverse reinforcement learning for long-horizon tasks with delayed rewards. arXiv preprint arXiv:1604.06508 (2016)
Murali*, A., Sen*, S., Kehoe, B., Garg, A., McFarland, S., Patil, S., Boyd, W., Lim, S., Abbeel, P., Goldberg, K., (*denotes equal contribution): Learning by Observation for Surgical Subtasks: Multilateral Cutting of 3D Viscoelastic and 2D Orthotropic Tissue Phantoms. In: IEEE Int. Conf. on Robotics and Automation (ICRA). (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

The AUTOLAB at UC, Berkeley, USA
Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller & Ken Goldberg
CAS/CVAP, KTH Royal Institute of Technology, Stockholm, Sweden
Florian T. Pokorny

Authors

Sanjay Krishnan
View author publications
You can also search for this author in PubMed Google Scholar
Animesh Garg
View author publications
You can also search for this author in PubMed Google Scholar
Richard Liaw
View author publications
You can also search for this author in PubMed Google Scholar
Brijen Thananjeyan
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Miller
View author publications
You can also search for this author in PubMed Google Scholar
Florian T. Pokorny
View author publications
You can also search for this author in PubMed Google Scholar
Ken Goldberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Animesh Garg .

Editor information

Editors and Affiliations

Industrial Engineering and Operations Research, University of California, Berkeley, CA, USA
Ken Goldberg
Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
Pieter Abbeel
Rutgers University, Piscataway, NJ, USA
Kostas Bekris
University of California, Berkeley, CA, USA
Lauren Miller

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Krishnan, S. et al. (2020). SWIRL: A SequentialWindowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards. In: Goldberg, K., Abbeel, P., Bekris, K., Miller, L. (eds) Algorithmic Foundations of Robotics XII. Springer Proceedings in Advanced Robotics, vol 13. Springer, Cham. https://doi.org/10.1007/978-3-030-43089-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-43089-4_43
Published: 07 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43088-7
Online ISBN: 978-3-030-43089-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics