An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback

Celemin, Carlos; Ruiz-del-Solar, Javier

doi:10.1007/s10846-018-0839-z

An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback

Published: 12 May 2018

Volume 95, pages 77–97, (2019)
Cite this article

Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

888 Accesses
23 Citations
Explore all metrics

Abstract

The main goal of this article is to present COACH (COrrective Advice Communicated by Humans), a new learning framework that allows non-expert humans to advise an agent while it interacts with the environment in continuous action problems. The human feedback is given in the action domain as binary corrective signals (increase/decrease the current action magnitude), and COACH is able to adjust the amount of correction that a given action receives adaptively, taking state-dependent past feedback into consideration. COACH also manages the credit assignment problem that normally arises when actions in continuous time receive delayed corrections. The proposed framework is characterized and validated extensively using four well-known learning problems. The experimental analysis includes comparisons with other interactive learning frameworks, with classical reinforcement learning approaches, and with human teleoperators trying to solve the same learning problems by themselves. In all the reported experiments COACH outperforms the other methods in terms of learning speed and final performance. It is of interest to add that COACH has been applied successfully for addressing a complex real-world learning problem: the dribbling of the ball by humanoid soccer players.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interactive Learning of Continuous Actions from Corrective Advice Communicated by Humans

A fast hybrid reinforcement learning framework with human corrective feedback

Article Open access 09 August 2018

Carlos Celemin, Javier Ruiz-del-Solar & Jens Kober

Coaching Robots: Online Behavior Learning from Human Subjective Feedback

References

Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement: the TAMER framework. In: The Fifth International Conference on Knowledge Capture (2009)
Argall, B.D., Browning, B., Veloso, M.: Learning robot motion control with demonstration and advice-operators. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (2008)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction, vol. 1, no. 1. MIT Press, Cambridge (1998)
Google Scholar
Leottau, L., Celemin, C., Ruiz-del-Solar, J.: Ball dribbling for humanoid biped robots: a reinforcement learning and fuzzy control approach. In: Robocup 2014: Robot World Cup XVIII, pp. 549–561. Springer (2015)
Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: ICML, vol. 98, pp. 463–471 (1998)
Vien, N.A., Ertel, W., Chung, T.C.: Learning via human feedback in continuous state and action spaces. Appl. Intell. 39(2), 267–278 (2013)
Article Google Scholar
Celemin, C., Ruiz-del-Solar, J.: Interactive learning of continuous actions from corrective advice communicated by humans. In: Robocup 2015: Robot World Cup XIX (2015)
Celemin, C., Ruiz-del-Solar, J.: COACH: learning continuous actions from corrective advice communicated by humans. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 581–586 (2015)
Chernova, S., Thomaz, A.L.: Robot learning from human teachers. Synth. Lect. Artif. Intell. Mach. Learn. 8(3), 1–121 (2014)
Article Google Scholar
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Rob. Auton. Syst. 57(5), 469–483 (2009)
Article Google Scholar
Billard, A., Calinon, S., Dillmann, R., Schaal, S.: Robot programming by demonstration. In: Springer handbook of robotics, pp. 1371–1394. Springer (2008)
Billing, E.A., Hellström, T.: A formalism for learning from demonstration. Paladyn J. Behav. Robot. 1(1), 1–13 (2010)
Article Google Scholar
Cuayáhuitl, H., van Otterlo, M., Dethlefs, N., Frommberger, L.: Machine learning for interactive systems and robots: a brief introduction. In: Proceedings of the 2nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, pp. 19–28, ACM (2013)
Amershi, S., Cakmak, M., Knox, W.B., Kulesza, T.: Power to the people: the role of humans in interactive machine learning. AI Mag. 35(4), 105–120 (2014)
Article Google Scholar
Fails, J.A., Olsen, D.R. Jr: Interactive machine learning. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 39–45 (2003)
Ware, M., Frank, E., Holmes, G., Hall, M., Witten, I.H.: Interactive machine learning: letting users build classifiers. Int. J. Hum. Comput. Stud. 55(3), 281–292 (2001)
Article MATH Google Scholar
Amershi, S., Fogarty, J., Weld, D.: Regroup: interactive machine learning for on-demand group creation in social networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 21–30 (2012)
Ngo, H., Luciw, M., Nagi, J., Forster, A., Schmidhuber, J., Vien, N.A.: Efficient interactive multiclass learning from binary feedback. ACM Trans. Interact. Intell. Syst. 4(3), 1–25 (2014)
Article Google Scholar
Aler, R., Garcia, O., Valls, J.M.: Correcting and improving imitation models of humans for robosoccer agents. In: The 2005 IEEE Congress on Evolutionary Computation, 2005, vol. 3, pp. 2402–2409 (2005)
Grollman, D.H., Jenkins, O.C.: Learning robot soccer skills from demonstration. In: IEEE 6th International Conference on Development and Learning, 2007. ICDL 2007, pp. 276–281 (2007)
Chernova, S., Veloso, M.: Multi-thresholded approach to demonstration selection for interactive robot learning. In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 225–232 (2008)
Weiss, A., Igelsböck, J., Calinon, S., Billard, A., Tscheligi, M.: Teaching a humanoid: a user study on learning by demonstration with hoap-3. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, 2009. RO-MAN 2009, pp. 147–152 (2009)
Breazeal, C., Berlin, M., Brooks, A., Gray, J., Thomaz, A.L.: Using perspective taking to learn from ambiguous demonstrations. Rob. Auton. Syst. 54(5), 385–393 (2006)
Article Google Scholar
Silver, D., Bagnell, J.A., Stentz, A.: Learning from demonstration for autonomous navigation in complex unstructured terrain. Int. J. Rob. Res. 29(12), 1565–1592 (2010)
Article Google Scholar
Yu, C.-C., Wang, C.-C.: Interactive learning from demonstration with a multilevel mechanism for collision-free navigation in dynamic environments. In: 2013 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 240–245 (2013)
Sweeney, J.D., Grupen, R.: A model of shared grasp affordances from demonstration. In: 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp. 27–35 (2007)
Lin, Y., Ren, S., Clevenger, M., Sun, Y.: Learning grasping force from demonstration. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 1526–1531 (2012)
Chernova, S.: Interactive policy learning through con?dence-based autonomy (2009).pdf. J. Artif. Intell. Res. 34, 1–25 (2009)
Article MathSciNet MATH Google Scholar
Meriçli, C., Veloso, M., Akin, H.: Complementary humanoid behavior shaping using corrective demonstration. In: 2010 10th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 334–339 (2010)
Meriçli, Ç., Veloso, M., Akin, H.: Task refinement for autonomous robots using complementary corrective human feedback. Int. J. Adv. Robot. Syst. 8(2), 68–79 (2011)
Article Google Scholar
Mericli, C.: Multi-Resolution Model Plus Correction Paradigm for Task and Skill Refinement on Autonomous Robots, Citeseer p. 135 (2011)
Argall, B.D.: Learning mobile robot motion control from demonstration and corrective feedback. Thesis (2009)
Argall, B.D., Browning, B., Veloso, M.M.: Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Rob. Auton. Syst. 59(3–4), 243–255 (2011)
Article Google Scholar
Meriçli, Ç., Veloso, M.: Improving biped walk stability using real-time corrective human feedback. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6556 LNAI, pp. 194–205 (2011)
Akrour, R., Schoenauer, M., Sebag, M.: Preference-based policy learning. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), Vol. 6911 LNAI, No. PART 1, pp. 12–27 (2011)
Akrour, R., Schoenauer, M., Souplet, J.-C., Sebag, M.: Programming by feedback. In: Proceedings of the 31St International Conference on Machine Learning, vol. 32, pp. 1503–1511 (2014)
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Advances in Neural Information Processing Systems, pp. 4302–4310 (2017)
Jain, A., Wojcik, B., Joachims, T., Saxena, A.: Learning trajectory preferences for manipulators via iterative improvement. In: Advances in neural information processing systems, pp. 575–583 (2013)
Mitsunaga, N., Smith, C., Kanda, T.: Adapting robot behavior for human – robot interaction. IEEE Trans. Robot. 24(4), 911–916 (2008)
Article Google Scholar
Tenorio-Gonzalez, A.C., Morales, E.F., Villaseñor-Pineda, L.: Dynamic reward shaping: training a robot by voice. In: Advances in Artificial Intelligence–IBERAMIA 2010, No. 214262, pp. 483–492. Springer (2010)
León, A., Morales, E.F., Altamirano, L., Ruiz, J.R.: Teaching a robot to perform task through imitation and on-line feedback. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 549–556 (2011)
Suay, H., Chernova, S.: Effect of human guidance and state space size on interactive reinforcement learning. In: RO-MAN, 2011 IEEE, pp. 1–6 (2011)
Pilarski, P.M., Dawson, M.R., Degris, T., Fahimi, F., Carey, J.P., Sutton, R.S.: Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In: IEEE International Conference on Rehabilitation Robotics, vol. 2011, p. 5975338 (2011)
Yanik, P.M., Manganelli, J., Merino, J., Threatt, A.L., Brooks, J.O., Green, K.E., Walker, I.D.: A gesture learning interface for simulated robot path shaping with a human teacher. IEEE Trans. Human-Machine Syst. 44(1), 41–54 (2014)
Article Google Scholar
Najar, A., Sigaud, O., Chetouani, M.: Training a robot with evaluative feedback and unlabeled guidance signals. In: IEEE International Symposium on Robot and Human Interactive Communication (ROMAN), pp. 261–266 (2016)
Knox, W.B., Stone, P.: TAMER: training an agent manually via evaluative reinforcement. In: 2008 7th IEEE International Conference on Development and Learning, pp. 292–297 (2008)
Knox, W.B.: Learning from human-generated reward. In: PhD Dissertation, The University of Texas at Austin (2012)
Haykin, S.: Neural networks: a comprehensive foundation. Knowl. Eng. Rev. 13, 4 (1999)
MATH Google Scholar
Vien, N.A., Ertel, W.: Reinforcement learning combined with human feedback in continuous state and action spaces. In: 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6 (2012)
Thomaz, A., Hoffman, G., Breazeal, C.: Reinforcement learning with human teachers: understanding how people want to teach robots. In: Proceedings - IEEE International Workshop on Robot and Human Interactive Communication, pp. 352–357 (2006)
Toris, R., Suay, H. B., Chernova, S.: A practical comparison of three robot learning from demonstration algorithms. In: 2012 7th ACM/IEEE International Conference on Human-Robot Interact. (HRI), pp. 261–262 (2012)
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming using Function Approximators, vol. 39. CRC Press (2010)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013)
Article Google Scholar
Takagi, T., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 1, 116–132 (1985)
Article MATH Google Scholar
Babuska, R.: Fuzzy and Neural Control. Disc Course Lecture Notes. Delft University Technology, Delft, Netherlands (2001)
Rahat, A.A.M.: Matlab implementation of controlling a bicycle using reinforcement learning. https://bitbucket.org/arahat/matlab-implementation-of-controlling-a-bicycle-using (2010)

Download references

Acknowledgements

This work was partially funded by FONDECYT project 1161500 and CONICYT-PCHA/Doctorado Nacional/2015-21151488.

Author information

Authors and Affiliations

Advanced Mining Technology Center & Department of Electrical Engineering, Universidad de Chile, Av. Tupper 2007, Santiago, Chile
Carlos Celemin & Javier Ruiz-del-Solar

Authors

Carlos Celemin
View author publications
You can also search for this author in PubMed Google Scholar
Javier Ruiz-del-Solar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Celemin.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(MP4 212 MB)

Appendix

Given that human feedback is a key component of the proposed learning framework, a new Hand-Gesture Recognition (HGR) interface that allows providing feedback to the agent is proposed. The interface allows detecting 5 gestures: positive correction, negative correction, a neutral gesture used when users do not need to provide feedback, a reward, and a punishment (see gestures in Fig. 15).

In order for the proposed system to be robust to variations in illumination, colors, and non-uniform backgrounds, it uses: (i) Gaussian Mixture Models (GMM) and based Background Subtraction (BS) to detect regions of interest (ROI), i.e. hand candidates, (ii) Kalman filtering for tracking the hand candidates, (iii) Local Binary Patterns (LBP) as features for characterizing the ROIs, and (iv) SVM classifiers for the final detection of the hand-gestures. The block diagram is shown in Fig. 16. The main functionalities are described in the following paragraphs:

Detection of Regions of Interest (ROI): Movement blobs are first detected using background subtraction. Then, adjacent blobs are merged and filtered using morphological filters, and the largest blob is selected as a hand candidate and fed to the tracking system.

In parallel, a second process applies BS to color edges: First, a binary edge image is computed, and then color information is incorporated into the edges. Afterwards, BS and area filtering is applied in the edge’s domain. Finally, the output of the area-filtering module is intersected with the color edges in the block “&”. In order to manage occlusions properly (see Fig. 16b) the block “&” deletes the blobs associated with the occluded edges, which are labeled by BS as regions with movement (Fig. 15 left); since those edges are not present in the original image. The output is a blob with the detected moving, color edges (Fig. 17 right).
Tracking: The parameters of the bounding box of the largest blob taken as a hand candidate by the prior module are used as observations by a Kalman filter, which estimates the final hand candidates, based on the fusion of the current ROI information with the prior ones. Afterwards, the image computed in the block “&” of the previous module is intersected with the Kalman-filtered bounding box. Examples of the resulting images are shown in Fig. 15.
Features Extraction and Classification: The image window given by the Tracking module is analyzed in order to classify the captured gesture. Histograms of LBP features are computed inside the image window. Since this window is a binary image, LBP are used as discretized measurements of the gradient. Then, the histograms of the LBP features are similar to Histograms of Gradient (HOG). This feature vector feeds five SVM classifiers, one trained for each gesture, where the gestures are detected.

The dataset used for training the SVM was built using images generated by the tracking module. Altogether, 1654 images of the five hand-gestures were recorded, 60% of them used for training, and 40% for validation. The classification error is 9.05%, which is considered appropriate to be used as an interface for the learning problems described in Section 4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Celemin, C., Ruiz-del-Solar, J. An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback. J Intell Robot Syst 95, 77–97 (2019). https://doi.org/10.1007/s10846-018-0839-z

Download citation

Received: 21 July 2017
Accepted: 21 February 2018
Published: 12 May 2018
Issue Date: 15 July 2019
DOI: https://doi.org/10.1007/s10846-018-0839-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback

Abstract

Access this article

Similar content being viewed by others

Interactive Learning of Continuous Actions from Corrective Advice Communicated by Humans

A fast hybrid reinforcement learning framework with human corrective feedback

Coaching Robots: Online Behavior Learning from Human Subjective Feedback

References

Acknowledgements