Skip to main content
Log in

Active reward learning with a novel acquisition function

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Reward functions are an essential component of many robot learning methods. Defining such functions, however, remains hard in many practical applications. For tasks such as grasping, there are no reliable success measures available. Defining reward functions by hand requires extensive task knowledge and often leads to undesired emergent behavior. We introduce a framework, wherein the robot simultaneously learns an action policy and a model of the reward function by actively querying a human expert for ratings. We represent the reward model using a Gaussian process and evaluate several classical acquisition functions (AFs) from the Bayesian optimization literature in this context. Furthermore, we present a novel AF, expected policy divergence. We demonstrate results of our method for a robot grasping task and show that the learned reward function generalizes to a similar task. Additionally, we evaluate the proposed novel AF on a real robot pendulum swing-up task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  • Akrour, R., Schoenauer, M., & Sebag, M. (2011). Preference-based policy learning. Machine learning and knowledge discovery in databases. Berlin: Springer.

    Google Scholar 

  • Akrour, R., Schoenauer, M., & Sebag, M. (2013). Interactive robot education. In European Conference on Machine Learning Workshop.

  • Balasubramanian, R., Ling, X., Brook, P. D., Smith, J. R., & Matsuoka, Y. (2012). Physical human interactive guidance: Identifying grasping principles from human-planned grasps. IEEE Transactions on Robotics, 28(4), 899–910.

    Article  Google Scholar 

  • Bratman, J., Singh, S., Sorg, J., & Lewis, R. (2012). Strong mitigation: Nesting search for good policies within search for good reward. In International Conference on Autonomous Agents and Multiagent Systems.

  • Cakmak, M., & Thomaz, A. L. (2012). Designing robot learners that ask good questions. In International Conference on Human-Robot Interaction.

  • Cheng, W., Fürnkranz, J., Hüllermeier, E., & Park, S.-H. (2011). Preference-based policy iteration: Leveraging preference learning for reinforcement learning. Machine learning and knowledge discovery in databases. Berlin: Springer.

    Google Scholar 

  • Chu, W., & Ghahramani, Z. (2005). Preference learning with Gaussian processes. In International Conference on Machine Learning.

  • Dang, H., & Allen, P. K. (2012). Learning grasp stability. In International Conference on Robotics and Automation.

  • Daniel, C., Neumann, G., & Peters, J. (2013). Learning sequential motor tasks. In International Conference on Robotics and Automation.

  • Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In International Conference on Autonomous Agents and Multiagent Systems.

  • Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71, 321–370.

    Article  Google Scholar 

  • Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In International Conference on Machine Learning.

  • Ghavamzadeh, M., & Engel, Y. (2007). Bayesian policy gradient algorithms. Advances in neural information processing systems. Cambridge, MA: MIT Press.

    Google Scholar 

  • Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems.

  • Hake, H. W., & Garner, W. R. (1951). The effect of presenting various numbers of discrete steps on scale reading accuracy. Journal of Experimental Psychology, 42, 358.

    Article  Google Scholar 

  • Hoffman, M. D., Brochu, E., & de Freitas, N. (2011). Portfolio allocation for Bayesian optimization. In Conference on Uncertainty in Artificial Intelligence.

  • Ijspeert, A., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. Advances in neural information processing systems. Cambridge, MA: MIT Press.

    Google Scholar 

  • Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems. Cambridge, MA: MIT Press.

    Google Scholar 

  • Julier, S. J, & Uhlmann, J. K. (1997). A new extension of the kalman filter to nonlinear systems. In International Symposium on Aerospace/Defense Sensing, Simulation and Controls.

  • Knox, W. B., & Stone, P. (2009). Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture.

  • Kober, J., Mohler, B. J., & Peters, J. (2008). Learning perceptual coupling for motor primitives. In Intelligent Robots and Systems.

  • Konidaris, G., & Barto, A. (2006). Autonomous shaping: Knowledge transfer in reinforcement learning. In International Conference on Machine Learning.

  • Kormushev, P., Calinon, S., & Caldwell, D. (2010). Robot motor skill coordination with EM-based reinforcement learning. In Intelligent Robots and Systems.

  • Kroemer, O., Detry, R., Piater, J., & Peters, J. (2010). Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems, 58, 1105–1116.

    Article  Google Scholar 

  • Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, 86(1), 97–106.

    MathSciNet  Google Scholar 

  • Lopes, M., Melo, F., & Montesano, L. (2009). Active learning for reward estimation in inverse reinforcement learning. Machine learning and knowledge discovery in databases. Berlin: Springer.

    Google Scholar 

  • Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97.

    Article  Google Scholar 

  • Mockus, J., Tiesis, V., & Zilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. Towards global optimization. Amsterdam: North Holland.

    Google Scholar 

  • Ng, A., & Coates, A. (1998). Autonomous inverted helicopter flight via reinforcement learning. In Experimental Robotics IX.

  • Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning.

  • Peters, J., Mülling, K., & Altun, Y. (2010). Relative entropy policy search. In National Conference on Artificial Intelligence.

  • Rasmussen, C. E., & Rasmussen, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Ratliff, N., Bagnell, A., & Zinkevich, M. (2006). Maximum margin planning. In International Conference on Machine Learning.

  • Ratliff, N., Silver, D., & Bagnell, A. (2009). Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27, 25–53.

    Article  Google Scholar 

  • Schoenauer, M., Akrour, R., Sebag, M., & Souplet, J.-C. (2014). Programming by feedback. In International Conference on Machine Learning.

  • Singh, S., Lewis, R. L., Barto, A. G., & Sorg, J. (2010). Intrinsically motivated reinforcement learning: An evolutionary perspective. Autonomous Mental Development, 2, 70–82.

    Article  Google Scholar 

  • Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Internation Conference on Machine Learning.

  • Suárez Feijóo, R., Cornellá Medrano, J., & Roa Garzón, M. (2014). Grasp quality measures: Review and performance. Autonomous Robots, 38(1), 65–88. http://link.springer.com/article/10.1007/s10514-014-9402-3

  • Thomaz, A. L., & Breazeal, C. (2008). Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172, 716–737.

    Article  Google Scholar 

  • Wilson, A., Fern, A., & Tadepalli, P. (2012). A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems.

  • Ziebart, B., Maas, A., Bagnell, A., & Dey, A. (2008) Maximum entropy inverse reinforcement learning. In Conference on Artificial Intelligence.

Download references

Acknowledgments

The authors want to thank for the support of the European Union Projects #FP7-ICT-270327 (Complacs) and #FP7-ICT-2013-10 (3rd Hand).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Daniel.

Additional information

This is one of several papers published in Autonomous Robots comprising the “Special Issue on Robotics Science and Systems”.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daniel, C., Kroemer, O., Viering, M. et al. Active reward learning with a novel acquisition function. Auton Robot 39, 389–405 (2015). https://doi.org/10.1007/s10514-015-9454-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-015-9454-z

Keywords

Navigation