Reinforcement Learning Agents

Ribeiro, C.

doi:10.1023/A:1015008417172

Reinforcement Learning Agents

Published: May 2002

Volume 17, pages 223–250, (2002)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

C. Ribeiro

416 Accesses
33 Citations
Explore all metrics

Abstract

Reinforcement Learning (RL) is learning through directexperimentation. It does not assume the existence of a teacher thatprovides examples upon which learning of a task takes place. Instead, inRL experience is the only teacher. With historical roots on the study ofbiological conditioned reflexes, RL attracts the interest of Engineersand Computer Scientists because of its theoretical relevance andpotential applications in fields as diverse as Operational Research andIntelligent Robotics.

Computationally, RL is intended to operate in a learning environmentcomposed by two subjects: the learner and a dynamic process. Atsuccessive time steps, the learner makes an observation of the processstate, selects an action and applies it back to the process. Its goal isto find out an action policy that controls the behavior of the dynamicprocess, guided by signals (reinforcements) that indicate how badly orwell it has been performing the required task. These signals are usuallyassociated to a dramatic condition – e.g., accomplishment of a subtask(reward) or complete failure (punishment), and the learner tries tooptimize its behavior by using a performance measure (a function of thereceived reinforcements). The crucial point is that in order to do that,the learner must evaluate the conditions (associations between observedstates and chosen actions) that led to rewards or punishments.

Starting from basic concepts, this tutorial presents the many flavorsof RL algorithms, develops the corresponding mathematical tools, assesstheir practical limitations and discusses alternatives that have beenproposed for applying RL to realistic tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike Elements that can Solve Difficult Learning Control Problems. IEE Transactions on Systems, Man and Cybernetics 13: 834–846.
Google Scholar
Bertsekas, D. P. (1995a). A Counterexample to Temporal Differences Learning. Neural Computation 7: 270–279.
Google Scholar
Bertsekas, D. P. (1995b). Dynamic Programming and Optimal Control, Vol. 1. Belmont, Massachusetts: Athena Scientific.
Google Scholar
Brafman, R. I. & Tennenholtz, M. (2001). R-MAX a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence 2: 953–958.
Google Scholar
Brooks, R. A. (1991). Elephants Don't Play Chess. In Maes, P. (ed.) Designing Autonomous Agents. MIT Press, pp. 3-15.
Chapman, D. & Kaelbling, L. P. (1991). Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons. Procs. Of the International Joint Conf. On Artificial Intelligence (IJCAI'91). 726-731.
Chrisman, L. (1992). Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. Procs. Of the 10th National Conf. On Artificial Intelligence. 183-188.
Crites, R. H. (1996). Large-scale Dynamic Optimization Using Teams of Reinforcement Learning Agents. PhD thesis, University of Massachusetts Amherst.
del R. Millán J. (1996). Rapid, Safe and Incremental Learning of Navigation Strategies. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 26: 408–420.
Google Scholar
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science 14: 179–211.
Google Scholar
Haykin, S. (1990). Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice-Hall.
Humphrys, M. (1996). Action Selection Methods Using Reinforcement Learning. PhD thesis, University of Cambridge.
Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994). On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation 6: 1185–1201.
Google Scholar
Kalmár, Z., Szepesvári, C. & Lorincz, A. (1998). Module-based Reinforcement Learning: Experiments with a Real Robot. Machine Learning.
Lin, L.-J. & Mitchell, T. M. (1992). Memory Approaches to Reinforcement Learning in Ono-Markovian Domains. CMU-CS-92 138, Carnegie Mellon University, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.
Google Scholar
Littman, M. L. & Szepesvári, C. (1996). A Generalized Reinforcement Learning Model: Convergence and Applications. Procs. Of the Thirteenth International Conf. on Machine Learning (ICML'96). 310-318.
Mahadevan, S & Connell, J. (1992). Automatic Programming of Behavior-Based Robots Using Reinforcement Learning. Artificial Intelligence 55: 311–365.
Google Scholar
Mataric, M (1998). Reinforcement Learning. Artificial Intelligence 3: 357–369.
Google Scholar
McCallum, A. K. (1996a). Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester.
McCallum, R. A. (1992). First Results with Utile Distinction Memory for Reinforcement Learning. Technical Report 446, The University of Rochester, Computer Science Department, The University Rochester, NT 14627.
Google Scholar
McCallum, R. A. (1996b). Hidden State and Reinforcement Learning with Instance-based State Identification. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 26: 464–473.
Google Scholar
Michie, D. & Chambers, R. A. (1968). BOXES: An Experiment in Adaptive Control. In Dale, E. & Michie, D. (eds.), Machine Intelligence 2. Edimburgh: Olivier and Boyd. 137–152.
Google Scholar
Papadimitriou, C. & Tsitsiklis, J. (1987). The Complexity of Markov Decision Processes. Mathematics of Operations Research 12: 441–450.
Google Scholar
Parr, R. E. (1998). Hierarchical Control and Learning for Markov Decision Processes. PhD thesis, University of California at Berkeley.
Google Scholar
Peng, J. & Williams, R. J. (1996). Incremental Multi-step Q-learning. Machine Learning 22: 282–290.
Google Scholar
Puterman, M. L. (1994). Markovian Decision Problems. John Wiley.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77.
Ribeiro, C. H. C. (1998). Embedding A Priori Knowledge in Reinforcement Learning. Journal of Intelligent and Robotic Systems 21: 51–71.
Google Scholar
Ribeiro, C. H. C. & Hemerly, E. M. (1999). Autonomous Learning Based on Cost Assumptions: Theoretical Studies and Experiments in Robot Control. International Journal of Neural Systems 9: 243–250.
Google Scholar
Robbins, H. & Monor, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics 22: 400–407.
Google Scholar
Russell, S. J. & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall.
Rylatt, M., Czarnecki, C. & Routen, T. (1998): Connectionist Learning in Behaviour-Based Mobile Robots: A Survey. Artificial Intelligence Review 12: 445–468.
Google Scholar
Sen, S. & Sekaran, M. (1998): Individual Learning of Coordination Knowledge. Journal of Experimental and Theoretical Artificial Intelligence 3: 333–356.
Google Scholar
Singh, S. P. & Bertsekas, D. (1997): Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. In Mozer, M. C., Jordan, M. I. & Petsche, T. (eds.) Advances in Neural Information Processing Systems 9.
Singh, S. P. & Dayan, P. (1996): Analytical Mean Squared Error Curves for Temporal Difference Learning. Machine Learning, in press.
Singh, S. P. & Jaakkola, T., & Jordan, M. I. (1995): Reinforcement Learning with Soft State Aggregation. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (eds.) Advances in Neural Infromation Processing Systems 7: 361–368.
Striebel, C. T. (1965): Sufficient Statistics in the Optimal Control of Stochastic System. Journal of Math. Analysis and Applications 12: 576–592.
Google Scholar
Sutton, R. S.: (1988): Learning to Predict by the Method of Temporal Differences. Machine Learning 3: 9–44.
Google Scholar
Sutton, R. S.: (1990): Integrated Architectures for Learning, Planning and Reacting Based on Approximating Dynamic Programming. Procs. Of the 7th International Conf. on Machine Leaning. 216-224.
Sutton, R. S. (1996): Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (eds.) Advances in Neural Information Processing Systems 8: 1038–1044.
Sutton, R. S. & Barto, A. G. (1990): Time-Derivative Models of Pavlovian Reinforcement. Learning and Computational Neuroscience: Foundations for Adaptive Networks. MIT Press.
Sutton, R. S., Precup, D. & Singh, S. (1998): Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales. Technical Report 98-74, Department of Computer Science - University of Massachusetts, Amherst.
Google Scholar
Szepesvári, C. (1997): Static and Dynamic Aspects of Optimal Sequential Decision Making. PhD thesis, József Attila University, Szeged, Hungary.
Google Scholar
Szepesvári, C. (1996): Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms. CS-96-11, Brown University, Department of Computer Science, Brown University, Providence, Rhode Island 02912.
Google Scholar
Tadepalli, P. & Ok, D. (1998): Model-based Average Reward Reinforcement Learning. Artificial Intelligence 100: 177–224.
Google Scholar
Tesauro, G. (1992): Practical Issues in Temporal Difference Learning. Machine Learning 8: 257–277.
Google Scholar
Tesauro, G. (1995): Temporal Difference Learning and T-D-Gammon. Communications of the ACM 38: 58–67.
Google Scholar
Tsitsiklis, J. N. & Roy, B. V. (1996): Feature-based Methods for Large Scale Dynamic Programming. Machine Learning 22: 59–94.
Google Scholar
Watkins, C. J. C. H. (1989): learning from Delayed Reward. PhD thesis, University of Cambridge.
Whitehead, S. D. & Ballard, D. H. (1990): Active Perception and Reinforcement Learning. Neural Computation 2: 409–419.
Google Scholar
Whitehead, S. D. & Lin, L.-J. (1995): Reinforcement Learning of non-Markov Decision Processes. Artificial intelligence 73: 271–306.
Google Scholar
Wolpert, D., Sil, J. & Tumer, K. (2001): Reinforcement Learning in Distributed Domains: Beyond Team Games. Proceedings of the 17th International Joint Conference on Artificial Intelligence 2: 819–824.
Google Scholar
Wyatt, J. (1997): Exploration and Inference in Learning from reinforcement. PhD thesis, University of Edinburgh.
Wyatt, J., Hoar, J. & Hayes, G. (1998): Design Analysis and Comparison of Robot Learners. Robotics and Autonomous Systems 24: 17–32.
Google Scholar

Download references

Authors

C. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ribeiro, C. Reinforcement Learning Agents. Artificial Intelligence Review 17, 223–250 (2002). https://doi.org/10.1023/A:1015008417172

Download citation

Issue Date: May 2002
DOI: https://doi.org/10.1023/A:1015008417172

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement Learning Agents

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning: A Survey

Reinforcement Learning: A Friendly Introduction

Reinforcement Learning and Adaptive Control

References

Rights and permissions

About this article

Cite this article

Navigation

Reinforcement Learning Agents

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning: A Survey

Reinforcement Learning: A Friendly Introduction

Reinforcement Learning and Adaptive Control

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation