Skip to main content
Log in

Expert Intervention Learning

An online framework for robot learning from explicit and implicit human feedback

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Scalable robot learning from human-robot interaction is critical if robots are to solve a multitude of tasks in the real world. Current approaches to imitation learning suffer from one of two drawbacks. On the one hand, they rely solely on off-policy human demonstration, which in some cases leads to a mismatch in train-test distribution. On the other, they burden the human to label every state the learner visits, rendering it impractical in many applications. We argue that learning interactively from expert interventions enjoys the best of both worlds. Our key insight is that any amount of expert feedback, whether by intervention or non-intervention, provides information about the quality of the current state, the quality of the action, or both. We formalize this as a constraint on the learner’s value function, which we can efficiently learn using no regret, online learning techniques. We call our approach Expert Intervention Learning (EIL), and evaluate it on a real and simulated driving task with a human expert, where it learns collision avoidance from scratch with just a few hundred samples (about one minute) of expert control.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. While we assume \(Q_\theta (\cdot )\) is convex to prove regret guarantees, the update can be applied to non-convex function classes like neural networks as done in similar works (Sun et al. 2017)

  2. Fréchet distance is a distance metric commonly used to compare trajectories of potentially uneven length. Informally, given a person walking along one trajectory and a dog following the other without either backtracking, the Fréchet distance is the length of the shortest possible leash for both to make it from start to finish.

  3. We modify the action space to have a low constant acceleration and no braking so that the action space was just a discrete set of possible steering angles \([-1,0,1]\) to more closely match that of the original DAgger experiment. We pre-process the 96x96 rgb pixel observation space to LAB color values, using the A,B channels to form a single channel binary thresholded image with all relevant features. We downscale that image to an 8x8 float image, and reshape that into the final state vector \(s\in {\mathbb {R}}^{64}\). The expert network is a DQN of dims 64,  (8),  3 with tanh activation at the hidden layer. We use the 8 hidden layer outputs as our feature vector. The learner function class \(\scriptstyle \varvec{q}(s,a)\) is the set of 27 weights and biases for the output layer.

References

  • Abbeel, P., & Ng, A.Y. (2004). Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first International Conference on Machine learning (ICML)

  • Alt, H., & Godau, M. (1995). Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5, 75–91.

    Article  MathSciNet  Google Scholar 

  • Amershi, S., Cakmak, M., Knox, W. B., & Kulesza, T. (2014). Power to the people: The role of humans in interactive machine learning. AI Magazine, 35, 105–120.

    Article  Google Scholar 

  • Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems.

  • Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2018). Learning from physical human corrections, one feature at a time. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)

  • Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2017). Learning robot objectives from physical human interaction. In: Proceedings of the 1st Annual Conference on Robot Learning (CoRL). PMLR

  • Bi, J., Dhiman, V., Xiao, T., & Xu, C. (2020). Learning from interventions using hierarchical policies for safe learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34–06, 10352–10360.

    Article  Google Scholar 

  • Bi, J., Xiao, T., Sun, Q., & Xu, C. (2018). Navigation by imitation in a pedestrian-rich environment. arXiv preprint arXiv:1811.00506

  • Celemin, C., & Ruiz-del Solar, J. (2019). An interactive framework for learning continuous actions policies based on corrective feedback. Journal of Intelligent & Robotic Systems, 95, 77–97.

    Article  Google Scholar 

  • Chen, M., Nikolaidis, S., Soh, H., Hsu, D., & Srinivasa, S. (2018). Planning with trust for human-robot collaboration. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)

  • Chernova, S., & Veloso, M. (2009). Interactive policy learning through confidence-based autonomy. Journal of Artificial Intelligence Research, 34, 1–25.

  • Choudhury, S., Dugar, V., Maeta, S., MacAllister, B., Arora, S., Althoff, D., & Scherer, S. (2019). High performance and safe flight of full-scale helicopters from takeoff to landing with an ensemble of planners. Journal of Field Robotics (JFR), 36(8), 1275–1332.

  • Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Machine Learning Journal (MLJ), 75(3), 297–325.

  • Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines

  • Fisac, J.F., Gates, M.A., Hamrick, J.B., Liu, C., Hadfield-Menell, D., Palaniappan, M., Malik, D., Sastry, S.S., Griffiths, T.L., & Dragan, A.D. (2019). Pragmatic-pedagogic value alignment. Robotics Research p. 49-57.

  • Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., & Waytowich, N. R. (2019). Efficiently combining human demonstrations and interventions for safe training of autonomous systems in real-time. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2462–2470.

    Article  Google Scholar 

  • Grollman, D.H., & Jenkins, O.C. (2007). Dogged learning for robots. In: Proceedings 2007 IEEE International Conference on Robotics and Automation (ICRA).

  • Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Hadfield-Menell, D., Russell, S.J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS).

  • Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. In: Advances in Neural Information Processing Systems (NeurIPS).

  • Judah, K., Fern, A.P., & Dietterich, T.G. (2012). Active imitation learning via reduction to iid active learning. In: 2012 AAAI Fall Symposium Series.

  • Kelly, M., Sidrane, C., Driggs-Campbell, K., & Kochenderfer, M.J. (2019). Hg-dagger: Interactive imitation learning with human experts. In: 2019 International Conference on Robotics and Automation (ICRA).

  • Kim, B., Farahmand, A., Pineau, J., & Precup, D. (2013). Learning from limited demonstrations. In: Advances in Neural Information Processing Systems (NeurIPS).

  • Kim, B., & Pineau, J. (2013). Maximum mean discrepancy imitation learning. In: Robotics: Science and Systems (RSS)

  • Kollmitz, M., Koller, T., Boedecker, J., & Burgard, W. (2020). Learning human-aware robot navigation from physical interaction via inverse reinforcement learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11025–11031. IEEE

  • Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S., Jamieson, K., Dragan, A., & Goldberg, K. (2017). Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations. In: IEEE International Conference on Robotics and Automation (ICRA).

  • Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv preprint arXiv:1703.09327

  • Laskey, M., Staszak, S., Hsieh, W.Y.S., Mahler, J., Pokorny, F.T., Dragan, A.D., & Goldberg, K. (2016). SHIV: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In: 2016 IEEE International Conference on Robotics and Automation (ICRA).

  • Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., & Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (IJRR).

  • Loftin, R., Peng, B., MacGlashan, J., Littman, M. L., Taylor, M. E., Huang, J., & Roberts, D. L. (2016). Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems.

  • MacGlashan, J., Ho, M.K., Loftin, R., Peng, B., Wang, G., Roberts, D.L., Taylor, M.E., & Littman, M.L. (2017). Interactive learning from policy-dependent human feedback. In: Proceedings of the 34th International Conference on Machine Learning (ICML).

  • McPherson, D.L., Scobee, D.R., Menke, J., Yang, A.Y., & Sastry, S.S. (2018). Modeling supervisor safe sets for improving collaboration in human-robot teams. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 861–868. IEEE.

  • Menda, K., Driggs-Campbell, K.R., & Kochenderfer, M.J. (2018). EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. arXiv preprint arXiv:1807.08364

  • Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., & Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2), 1–179.

  • Packard, B., & Ontañón, S. (2017). Policies for active learning from demonstration. In: 2017 AAAI Spring Symposium Series

  • Pomerleau, D.A. (1989). Alvinn: An autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems (NeurIPS)

  • Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AIStats).

  • Ross, S., Melik-Barkhudarov, N., Shankar, K.S., Wendel, A., Dey, D., Bagnell, J.A., & Hebert, M. (2013). Learning monocular reactive UAV control in cluttered natural environments. In: IEEE International Conference on Robotics and Automation (ICRA).

  • Sadat, A., Ren, M., Pokrovsky, A., Lin, Y.C., Yumer, E., & Urtasun, R. (2019). Jointly learnable behavior and trajectory planning for self-driving vehicles. arXiv preprint arXiv:1910.04586

  • Sadigh, D., Sastry, S.S., Seshia, S.A., & Dragan, A. (2016). Information gathering actions over human internal state. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  • Saunders, W., Sastry, G., Stuhlmueller, A., & Evans, O. (2017). Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173

  • Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.

    Article  Google Scholar 

  • Spencer, J., Choudhury, S., Barnes, M., Schmittle, M., Chiang, M., Ramadge, P., & Srinivasa, S. (2020). Learning from interventions: Human-robot interaction as both explicit and implicit feedback. In: Robotics: Science and Systems (RSS).

  • Spencer, J., Choudhury, S., Venkatraman, A., Ziebart, B., & Bagnell, J.A. (2021). Feedback in imitation learning: The three regimes of covariate shift. arXiv preprint arXiv:2102.02872

  • Srinivasa, S.S., Lancaster, P., Michalove, J., Schmittle, M., Summers, C., Rockett, M., Smith, J.R., Choudhury, S., Mavrogiannis, C., & Sadeghi, F. (2019). MuSHR: A Low-Cost, Open-Source Robotic Racecar for Education and Research. arXiv preprint arXiv:1908.08031

  • Sun, W., Venkatraman, A., Gordon, G.J., Boots, B., & Bagnell, J.A. (2017). Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction. In: Proceedings of the 34th International Conference on Machine Learning (ICML).

  • Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML).

Download references

Acknowledgements

This work was (partially) funded by the DARPA Dispersed Computing program, NIH R01 (R01EB019335), NSF CPS (#1544797), NSF NRI (#1637748), the Office of Naval Research, RCTA, Amazon, and Honda Research Institute USA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonathan Spencer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.

Appendices

Proofs

1.1 Reduction to no-regret, online learning

The general non i.i.d optimization we wish to solve is

$$\begin{aligned}&\min _{\pi } {\mathbb {E}}_{(s,a) \sim d^I_\pi (s,a)} \ell _C(s,a,\theta ) \nonumber \\&\quad + \lambda {\mathbb {E}}_{(s,a) \sim d_\pi (s,a)} \ell _B(s,a,\theta ). \end{aligned}$$
(18)

We’ll directly prove the general setting here rather than proving individually for \(\ell _C\) and \(\ell _B\).

We reduce this optimization problem to a sequence of convex losses \(\ell _i(\theta )\) where the i-th loss is a function of the distribution at that iteration, \(\ell _i(\theta )={\mathbb {E}}_{(s,a) \sim d^I_i} \ell _C(s,a,\theta ) + \lambda {\mathbb {E}}_{(s,a) \sim d_i} \ell _B(s,a,\theta )\) In our algorithm, the learner at iteration i applies Follow-the-Leader (FTL)

$$\begin{aligned} \begin{aligned} \theta _{i+1}&= \arg \min _\theta \sum _{t=1}^i&\ell _t(\theta ) \\&= \arg \min _\theta \sum _{t=1}^i&{\mathbb {E}}_{(s,a) \sim d^I_t} \ell _C(s,a,\theta ) \\&+\lambda {\mathbb {E}}_{(s,a) \sim d_ts} \ell _B(s,a,\theta ) \end{aligned} \end{aligned}$$
(19)

Since FTL is a no-regret algorithm, we have the average regret

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^N \ell _i(\theta _i) - \min _\theta \frac{1}{N} \sum _{i=1}^N \ell _i(\theta ) \le \gamma _N \end{aligned}$$
(20)

go to 0 as \(N\rightarrow \infty \), with \({\tilde{O}}(\tfrac{1}{N})\) for strongly convex \(\ell _i\), (See Theorem 2.4 and Corollary 2.2 in Shalev-Shwartz (2012))

In this framework, we restate and prove Thm. 1.

Theorem 2

Let \(\ell _i(\theta ) = {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} \ell (s,a,\theta )\). Also let \(\epsilon _N = \min _{\theta }\frac{1}{N} \sum _{i=1}^N \ell _i(\theta )\) be the loss of the best parameter in hindsight after N iterations. Let \(\gamma _N\) be the average regret of \(\theta _{1:N}\). There exists a \(\theta \in \theta _{1:N}\) s.t.

$$\begin{aligned} {\mathbb {E}}_{(s,a) \sim d_{\pi _{\theta }}} [\ell (s,a,\theta )] \le \epsilon _N + \gamma _N \end{aligned}$$
(21)

Proof

The performance of the best learner in the sequence \(\theta _1,\cdots ,\theta _N\) must be smaller than the average loss of each learner on its own induced distribution (min smaller than average)

$$\begin{aligned}&\min _{\theta \in \theta _{1:N}} {\mathbb {E}}_{(s,a)\sim d_{\pi _\theta }} [\ell (s,a,\theta )]\nonumber \\&\quad \le \frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta _i)] \end{aligned}$$
(22)

Using (20) we have

$$\begin{aligned} \begin{aligned}&\frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta _i)] \\&\quad \le \gamma _N + \min _\theta \frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta )] \\&\quad \le \gamma _N + \epsilon _N \end{aligned} \end{aligned}$$
(23)

\(\square \)

This proof can be extended for finite sample cases following the original DAgger proofs. This theorem applies to each portion of the objective individually, yielding regret terms \(\gamma _N^B\) and \(\gamma _N^I\) which each individually go to zero as \(N\rightarrow \infty \), thus we are guaranteed that the combined objective as well as each individual objective is zero regret.

HG-DAgger counter example

We construct a counter-example for HG-DAgger approaches (Kelly et al. 2019; Goecks et al. 2019; Bi et al. 2018) in Fig. 9. Recall that in HG-DAgger, we only use the intervention loss \(\ell _C(.)\).

The MDP is such that the learner can choose between two actions - Left (L) and Right (R) only at states \(s_0\) and \(s_1\). Unknown to the learner, but known to the expert, some of the edges are associated with costs. The expert deems a “good enough” state as having value of \(-9\). Hence whenver the learner enters \(s_1\), the expert takes over to intervene and demonstrates \((s_1, L)\).

Fig. 9
figure 9

Counter example for HG-Dagger. Edges without costs are assumed to have \(c=0\), and a single edge leaving a node corresponds to taking any action

HG-DAgger only keeps this intervention data and uses it as classification loss. Let’s say it is using a tabular policy. If it learns the policy \((s_0,L)\) and \((s_1,L)\) - it will indeed achieve \(\ell _c(s,a,\theta )=0\). However, the expert will continue to intervene as this policy always exits the good enough state

Let’s look at all policies and their implicit bounds and intervention losses. Assume we get a penalty of 1 for every bad state or misclassified action. We have:

  1. 1.

    Policy \((s_0, L), (s_1,L)\): Loss \(\ell _B = 2\), \(\ell _C=0\)

  2. 2.

    Policy \((s_0, L), (s_1,R)\): Loss \(\ell _B = 2\), \(\ell _C=1\)

  3. 3.

    Policy \((s_0, R), (s_1,L)\): Loss \(\ell _B = 0\), \(\ell _C=0\)

  4. 4.

    Policy \((s_0, R), (s_1,R)\): Loss \(\ell _B = 0\), \(\ell _C=0\)

The last two policies have the same intervention loss because the induced distribution is such that these policies never result in interventions (even though one learns an incorrect intervention action).

HG-DAgger looks at only the last column and hence may not end up learning \((s_0, R)\). EIL on the other hand will.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spencer, J., Choudhury, S., Barnes, M. et al. Expert Intervention Learning. Auton Robot 46, 99–113 (2022). https://doi.org/10.1007/s10514-021-10006-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-021-10006-9

Navigation