Skip to main content
Log in

Learning via human feedback in continuous state and action spaces

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper considers the problem of extending Training an Agent Manually via Evaluative Reinforcement (TAMER) in continuous state and action spaces. Investigative research using the TAMER framework enables a non-technical human to train an agent through a natural form of human feedback (negative or positive). The advantages of TAMER have been shown on tasks of training agents by only human feedback or combining human feedback with environment rewards. However, these methods are originally designed for discrete state-action, or continuous state-discrete action problems. This paper proposes an extension of TAMER to allow both continuous states and actions, called ACTAMER. The new framework utilizes any general function approximation of a human trainer’s feedback signal. Moreover, a combined capability of ACTAMER and reinforcement learning is also investigated and evaluated. The combination of human feedback and reinforcement learning is studied in both settings: sequential and simultaneous. Our experimental results demonstrate the proposed method successfully allowing a human to train an agent in two continuous state-action domains: Mountain Car and Cart-pole (balancing).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. With respect to the actor’s parameters.

  2. See http://library.rl-community.org/.

References

  1. Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first international conference on machine learning (ICML), pp 1–8

    Chapter  Google Scholar 

  2. Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846

    Article  Google Scholar 

  3. Baxter J, Tridgell A, Weaver L (2000) Learning to play chess using temporal differences. Mach Learn 40(3):243–263

    Article  MATH  Google Scholar 

  4. Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45(11):2471–2482

    Article  MathSciNet  MATH  Google Scholar 

  5. Detry R, Baseski E, Popovic M, Touati Y, Krüger N, Kroemer O, Peters J, Piater JH (2010) Learning continuous grasp affordances by sensorimotor exploration. In: From motor learning to interaction learning in robots, pp 451–465

    Chapter  Google Scholar 

  6. Granmo OC, Glimsdal S (2012) Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore game. Appl Intell

  7. Hong J, Prabhu VV (2004) Distributed reinforcement learning control for batch sequencing and sizing in just-in-time manufacturing systems. Appl Intell 20(1):71–87

    Article  Google Scholar 

  8. Iglesias A, Martínez P, Aler R, Fernández F (2009) Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Appl Intell 31(1):89–106

    Article  Google Scholar 

  9. Judah K, Roy S, Fern A, Dietterich TG (2010) Reinforcement learning via practice and critique advice. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, pp 481–486

    Google Scholar 

  10. Knox WB, Glass BD, Love BC, Maddox WT, Stone P (2012) How humans teach agents: a new experimental perspective. Int J Soc Robot 4(4):409–421

    Article  Google Scholar 

  11. Knox WB, Setapen A, Stone P (2011) Reinforcement learning with human feedback in Mountain Car. In: AAAI 2011 spring symposium, pp 36–41

    Google Scholar 

  12. Knox WB, Stone P (2008) TAMER: training of an agent manually via evaluative reinforcement. In: IEEE 7th international conference on development and learning (ICDL-08), pp 292–297

    Google Scholar 

  13. Knox WB, Stone P (2009) Interactively shaping agents via human reinforcement: the TAMER framework. In: Proceedings of the 5th international conference on knowledge capture (K-CAP), pp 9–16

    Chapter  Google Scholar 

  14. Knox WB, Stone P (2010) Combining manual feedback with subsequent MDP reward signals for reinforcement learning. In: 9th international conference on autonomous agents and multiagent systems (AAMAS), pp 5–12

    Google Scholar 

  15. Knox WB, Stone P (2010) Training a Tetris agent via interactive shaping: a demonstration of the TAMER framework. In: 9th international conference on autonomous agents and multiagent systems (AAMAS), pp 1767–1768

    Google Scholar 

  16. Knox WB, Stone P (2011) Augmenting reinforcement learning with human feedback. In: 2011 ICML workshop on new developments in imitation learning

    Google Scholar 

  17. Knox WB, Stone P (2012) Reinforcement learning from simultaneous human and MDP reward. In: 11st international conference on autonomous agents and multiagent systems (AAMAS), pp 475–482

    Google Scholar 

  18. Kober J, Mohler BJ, Peters J (2010) Imitation and reinforcement learning for motor primitives with perceptual coupling. In: From motor learning to interaction learning in robots, pp 209–225

    Chapter  Google Scholar 

  19. Kober J, Peters J (2011) Policy search for motor primitives in robotics. Mach Learn 84(1–2):171–203

    Article  MATH  Google Scholar 

  20. Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–1166

    Article  MathSciNet  MATH  Google Scholar 

  21. Kroemer O, Detry R, Piater JH, Peters J (2010) Combining active learning and reactive control for robot grasping. Robot Auton Syst 58(9):1105–1116

    Article  Google Scholar 

  22. Li J, Li Z, Chen J (2011) Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot. Appl Intell 34(2):211–225

    Article  Google Scholar 

  23. Pakizeh E, Palhang M, Pedram MM (2012) Multi-criteria expertness based cooperative Q-learning. Appl Intell

  24. Phillips-Wren GE, Mørch AI, Tweedale J, Ichalkaranje N (2007) Innovations in agent collaboration, cooperation and teaming, part 2. J Netw Comput Appl 30(3):1085–1088

    Article  Google Scholar 

  25. Pilarski PM, Dawson MR, Degris T, Fahimi F, Carey JP, Sutton RS (2011) Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In: IEEE international conference on rehabilitation robotics, pp 1–7

    Google Scholar 

  26. Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):210–229

    Article  MathSciNet  Google Scholar 

  27. Santamaria JC, Sutton RS, Ram A (1998) Experiments with reinforcement learning in problems with continuous state and action spaces. Adapt Behav 6(2):163–218

    Article  Google Scholar 

  28. Sherstov AA, Stone P (2005) Function approximation via tile coding: automating parameter choice. In: Abstraction, reformulation and approximation, 6th international symposium (SARA), pp 194–205

    Chapter  Google Scholar 

  29. Singh SP, Bertsekas D (1996) Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: Advances in neural information processing systems (NIPS), pp 974–980

    Google Scholar 

  30. Singh SP, Jaakkola T, Jordan MI (1994) Learning without state-estimation in partially observable Markovian decision processes. In: Machine learning, Proceedings of the eleventh international conference (ICML), pp 284–292

    Google Scholar 

  31. Subramanian K, Isbell C, Thomaz A (2011) Learning options through human interaction. In: Workshop on agents learning interactively from human teachers at IJCAI

    Google Scholar 

  32. Sutton RS (1995) Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in neural information processing systems (NIPS), vol 8, pp 1038–1044

    Google Scholar 

  33. Sutton RS, Barto AG (1990) Technical note q-learning. In: Learning and computational neuroscience: foundations of adaptive networks, pp 497–537

    Google Scholar 

  34. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    Google Scholar 

  35. Sutton RS, McAllester DA, Singh SP, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, vol 12. NIPS conference, Denver, Colorado, USA, pp 1057–1063

    Google Scholar 

  36. Taylor ME, Chernova S (2010) Integrating human demonstration and reinforcement learning: initial results in human-agent transfer. In: Proceedings of the agents learning interactively from human teachers workshop (at AAMAS-10)

    Google Scholar 

  37. Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8:257–277

    MATH  Google Scholar 

  38. Tesauro G (1994) Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6(2):215–219

    Article  Google Scholar 

  39. Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68

    Article  Google Scholar 

  40. Thomaz AL, Breazeal C (2006) Reinforcement learning with human teachers: evidence of feedback and guidance with implications for learning performance. In: Proceedings, the twenty-first national conference on artificial intelligence and the eighteenth innovative applications of artificial intelligence conference

    Google Scholar 

  41. Vien NA, Viet NH, Lee S, Chung T (2009) Policy gradient SMDP for resource allocation and routing in integrated services networks. IEICE Trans 92-B(6):2008–2022

    Google Scholar 

  42. Vien NA, Yu H, Chung T (2011) Hessian matrix distribution for Bayesian policy gradient reinforcement learning. Inf Sci 181(9):1671–1685

    Article  MathSciNet  MATH  Google Scholar 

  43. Witten IH (1977) An adaptive optimal controller for discrete-time Markov environments. Inf Control 34(4):286–295

    Article  MathSciNet  MATH  Google Scholar 

  44. Wooldridge M (1997) Agent-based software engineering. In: IEE proceedings on software engineering, pp 26–37

    Google Scholar 

  45. Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International joint conferences on artificial intelligence, pp 1114–1120

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Collaborative Center of Applied Research on Service Robotics (ZAFH Servicerobotik, http://www.zafh-servicerobotik.de) and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2010-0012609).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ngo Anh Vien.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vien, N.A., Ertel, W. & Chung, T.C. Learning via human feedback in continuous state and action spaces. Appl Intell 39, 267–278 (2013). https://doi.org/10.1007/s10489-012-0412-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0412-6

Keywords

Navigation