Skip to main content
Log in

Reinforcement Learning Agents

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Reinforcement Learning (RL) is learning through directexperimentation. It does not assume the existence of a teacher thatprovides examples upon which learning of a task takes place. Instead, inRL experience is the only teacher. With historical roots on the study ofbiological conditioned reflexes, RL attracts the interest of Engineersand Computer Scientists because of its theoretical relevance andpotential applications in fields as diverse as Operational Research andIntelligent Robotics.

Computationally, RL is intended to operate in a learning environmentcomposed by two subjects: the learner and a dynamic process. Atsuccessive time steps, the learner makes an observation of the processstate, selects an action and applies it back to the process. Its goal isto find out an action policy that controls the behavior of the dynamicprocess, guided by signals (reinforcements) that indicate how badly orwell it has been performing the required task. These signals are usuallyassociated to a dramatic condition – e.g., accomplishment of a subtask(reward) or complete failure (punishment), and the learner tries tooptimize its behavior by using a performance measure (a function of thereceived reinforcements). The crucial point is that in order to do that,the learner must evaluate the conditions (associations between observedstates and chosen actions) that led to rewards or punishments.

Starting from basic concepts, this tutorial presents the many flavorsof RL algorithms, develops the corresponding mathematical tools, assesstheir practical limitations and discusses alternatives that have beenproposed for applying RL to realistic tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike Elements that can Solve Difficult Learning Control Problems. IEE Transactions on Systems, Man and Cybernetics 13: 834–846.

    Google Scholar 

  • Bertsekas, D. P. (1995a). A Counterexample to Temporal Differences Learning. Neural Computation 7: 270–279.

    Google Scholar 

  • Bertsekas, D. P. (1995b). Dynamic Programming and Optimal Control, Vol. 1. Belmont, Massachusetts: Athena Scientific.

    Google Scholar 

  • Brafman, R. I. & Tennenholtz, M. (2001). R-MAX a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence 2: 953–958.

    Google Scholar 

  • Brooks, R. A. (1991). Elephants Don't Play Chess. In Maes, P. (ed.) Designing Autonomous Agents. MIT Press, pp. 3-15.

  • Chapman, D. & Kaelbling, L. P. (1991). Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons. Procs. Of the International Joint Conf. On Artificial Intelligence (IJCAI'91). 726-731.

  • Chrisman, L. (1992). Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. Procs. Of the 10th National Conf. On Artificial Intelligence. 183-188.

  • Crites, R. H. (1996). Large-scale Dynamic Optimization Using Teams of Reinforcement Learning Agents. PhD thesis, University of Massachusetts Amherst.

  • del R. Millán J. (1996). Rapid, Safe and Incremental Learning of Navigation Strategies. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 26: 408–420.

    Google Scholar 

  • Elman, J. L. (1990). Finding Structure in Time. Cognitive Science 14: 179–211.

    Google Scholar 

  • Haykin, S. (1990). Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice-Hall.

  • Humphrys, M. (1996). Action Selection Methods Using Reinforcement Learning. PhD thesis, University of Cambridge.

  • Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994). On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation 6: 1185–1201.

    Google Scholar 

  • Kalmár, Z., Szepesvári, C. & Lorincz, A. (1998). Module-based Reinforcement Learning: Experiments with a Real Robot. Machine Learning.

  • Lin, L.-J. & Mitchell, T. M. (1992). Memory Approaches to Reinforcement Learning in Ono-Markovian Domains. CMU-CS-92 138, Carnegie Mellon University, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.

    Google Scholar 

  • Littman, M. L. & Szepesvári, C. (1996). A Generalized Reinforcement Learning Model: Convergence and Applications. Procs. Of the Thirteenth International Conf. on Machine Learning (ICML'96). 310-318.

  • Mahadevan, S & Connell, J. (1992). Automatic Programming of Behavior-Based Robots Using Reinforcement Learning. Artificial Intelligence 55: 311–365.

    Google Scholar 

  • Mataric, M (1998). Reinforcement Learning. Artificial Intelligence 3: 357–369.

    Google Scholar 

  • McCallum, A. K. (1996a). Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester.

  • McCallum, R. A. (1992). First Results with Utile Distinction Memory for Reinforcement Learning. Technical Report 446, The University of Rochester, Computer Science Department, The University Rochester, NT 14627.

    Google Scholar 

  • McCallum, R. A. (1996b). Hidden State and Reinforcement Learning with Instance-based State Identification. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 26: 464–473.

    Google Scholar 

  • Michie, D. & Chambers, R. A. (1968). BOXES: An Experiment in Adaptive Control. In Dale, E. & Michie, D. (eds.), Machine Intelligence 2. Edimburgh: Olivier and Boyd. 137–152.

    Google Scholar 

  • Papadimitriou, C. & Tsitsiklis, J. (1987). The Complexity of Markov Decision Processes. Mathematics of Operations Research 12: 441–450.

    Google Scholar 

  • Parr, R. E. (1998). Hierarchical Control and Learning for Markov Decision Processes. PhD thesis, University of California at Berkeley.

    Google Scholar 

  • Peng, J. & Williams, R. J. (1996). Incremental Multi-step Q-learning. Machine Learning 22: 282–290.

    Google Scholar 

  • Puterman, M. L. (1994). Markovian Decision Problems. John Wiley.

  • Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77.

  • Ribeiro, C. H. C. (1998). Embedding A Priori Knowledge in Reinforcement Learning. Journal of Intelligent and Robotic Systems 21: 51–71.

    Google Scholar 

  • Ribeiro, C. H. C. & Hemerly, E. M. (1999). Autonomous Learning Based on Cost Assumptions: Theoretical Studies and Experiments in Robot Control. International Journal of Neural Systems 9: 243–250.

    Google Scholar 

  • Robbins, H. & Monor, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics 22: 400–407.

    Google Scholar 

  • Russell, S. J. & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall.

  • Rylatt, M., Czarnecki, C. & Routen, T. (1998): Connectionist Learning in Behaviour-Based Mobile Robots: A Survey. Artificial Intelligence Review 12: 445–468.

    Google Scholar 

  • Sen, S. & Sekaran, M. (1998): Individual Learning of Coordination Knowledge. Journal of Experimental and Theoretical Artificial Intelligence 3: 333–356.

    Google Scholar 

  • Singh, S. P. & Bertsekas, D. (1997): Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. In Mozer, M. C., Jordan, M. I. & Petsche, T. (eds.) Advances in Neural Information Processing Systems 9.

  • Singh, S. P. & Dayan, P. (1996): Analytical Mean Squared Error Curves for Temporal Difference Learning. Machine Learning, in press.

  • Singh, S. P. & Jaakkola, T., & Jordan, M. I. (1995): Reinforcement Learning with Soft State Aggregation. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (eds.) Advances in Neural Infromation Processing Systems 7: 361–368.

  • Striebel, C. T. (1965): Sufficient Statistics in the Optimal Control of Stochastic System. Journal of Math. Analysis and Applications 12: 576–592.

    Google Scholar 

  • Sutton, R. S.: (1988): Learning to Predict by the Method of Temporal Differences. Machine Learning 3: 9–44.

    Google Scholar 

  • Sutton, R. S.: (1990): Integrated Architectures for Learning, Planning and Reacting Based on Approximating Dynamic Programming. Procs. Of the 7th International Conf. on Machine Leaning. 216-224.

  • Sutton, R. S. (1996): Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (eds.) Advances in Neural Information Processing Systems 8: 1038–1044.

  • Sutton, R. S. & Barto, A. G. (1990): Time-Derivative Models of Pavlovian Reinforcement. Learning and Computational Neuroscience: Foundations for Adaptive Networks. MIT Press.

  • Sutton, R. S., Precup, D. & Singh, S. (1998): Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales. Technical Report 98-74, Department of Computer Science - University of Massachusetts, Amherst.

    Google Scholar 

  • Szepesvári, C. (1997): Static and Dynamic Aspects of Optimal Sequential Decision Making. PhD thesis, József Attila University, Szeged, Hungary.

    Google Scholar 

  • Szepesvári, C. (1996): Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms. CS-96-11, Brown University, Department of Computer Science, Brown University, Providence, Rhode Island 02912.

    Google Scholar 

  • Tadepalli, P. & Ok, D. (1998): Model-based Average Reward Reinforcement Learning. Artificial Intelligence 100: 177–224.

    Google Scholar 

  • Tesauro, G. (1992): Practical Issues in Temporal Difference Learning. Machine Learning 8: 257–277.

    Google Scholar 

  • Tesauro, G. (1995): Temporal Difference Learning and T-D-Gammon. Communications of the ACM 38: 58–67.

    Google Scholar 

  • Tsitsiklis, J. N. & Roy, B. V. (1996): Feature-based Methods for Large Scale Dynamic Programming. Machine Learning 22: 59–94.

    Google Scholar 

  • Watkins, C. J. C. H. (1989): learning from Delayed Reward. PhD thesis, University of Cambridge.

  • Whitehead, S. D. & Ballard, D. H. (1990): Active Perception and Reinforcement Learning. Neural Computation 2: 409–419.

    Google Scholar 

  • Whitehead, S. D. & Lin, L.-J. (1995): Reinforcement Learning of non-Markov Decision Processes. Artificial intelligence 73: 271–306.

    Google Scholar 

  • Wolpert, D., Sil, J. & Tumer, K. (2001): Reinforcement Learning in Distributed Domains: Beyond Team Games. Proceedings of the 17th International Joint Conference on Artificial Intelligence 2: 819–824.

    Google Scholar 

  • Wyatt, J. (1997): Exploration and Inference in Learning from reinforcement. PhD thesis, University of Edinburgh.

  • Wyatt, J., Hoar, J. & Hayes, G. (1998): Design Analysis and Comparison of Robot Learners. Robotics and Autonomous Systems 24: 17–32.

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ribeiro, C. Reinforcement Learning Agents. Artificial Intelligence Review 17, 223–250 (2002). https://doi.org/10.1023/A:1015008417172

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1015008417172

Navigation