Markov Decision Processes

Uther, William

doi:10.1007/978-0-387-30164-8_512

William Uther

649 Accesses

Synonyms

Policy search

Definition

A Markov Decision Process (MDP) is a discrete, stochastic, and generally finite model of a system to which some external control can be applied. Originally developed in the Operations Research and Statistics communities, MDPs, and their extension to Partially Observable Markov Decision Processes (POMDPs), are now commonly used in the study of reinforcement learning in the Artificial Intelligence and Robotics communities (Bellman, 1957; Bertsekas & Tsitsiklis, 1996Howard, 1960; Puterman, 1994; ). When used for reinforcement learning, firstly the parameters of an MDP are learned from data, and then the MDP is processed to choose a behavior.

Formally, an MDP is defined as a tuple: \(<\mathcal{S},\mathcal{A},T,R>\), where \(\mathcal{S}\) is a discrete set of states, \(\mathcal{A}\) is a discrete set of actions, \(T : \mathcal{S}\times \mathcal{A}\rightarrow (\mathcal{S}\rightarrow \mathbb{R})\) is a stochastic transition function, and \(R :...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

Albus, J. S. (1981). Brains, behavior, and robotics. Peterborough: BYTE. ISBN: 0070009759.
Google Scholar
Andre, D., Friedman, N., & Parr, R. (1997). Generalized prioritized sweeping. Neural and Information Processing Systems, pp. 1001–1007.
Google Scholar
Andre, D., Russell, S. J. (2002). State abstraction for programmable reinforcement learning agents. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI).
Google Scholar
Baird, L. C. (1995). Residual algorithms: reinforcement learning with function approximation. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (ICML95) (pp. 30–37). San Mateo: Morgan Kaufmann.
Google Scholar
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
MATH Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming.
Google Scholar
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.
MATH MathSciNet Google Scholar
Gordon, G. J. (1995). Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University.
Google Scholar
Guestrin, C., et al. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19, 399–468.
MATH MathSciNet Google Scholar
Hansen, E. A., & Zilberstein, S. (1998). Heuristic search in cyclic AND/OR graphs. Proceedings of the Fifteenth National Conference on Artificial Intelligence. http://rbr.cs.umass.edu/shlomo/papers/HZaaai98.html
Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge: MIT Press.
MATH Google Scholar
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. European Conference on Machine Learning (ECML). Lecture Notes in Computer Science 4212, Springer, pp. 282–293.
Google Scholar
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13, 103–130.
Google Scholar
Moore, A. W., Baird, L., & Pack Kaelbling, L. (1999). Multi-value-functions: efficient automatic action hierarchies for multiple goal MDPs. International Joint Conference on Artificial Intelligence (IJCAI99).
Google Scholar
Munos, R., & Moore, A. W. (2001). Variable resolution discretization in optimal control. Machine Learning, 1, 1–31.
Google Scholar
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley. ISBN: 0-471-61977-9.
Google Scholar
St-Aubin, R., Hoey, J., & Boutilier, C. (2000). APRICODD: approximate policy construction using decision diagrams. NIPS-2000.
Google Scholar
Sutton, R. S., Precup, D., & Singh, S. (1998). Intra-option learning about temporally abstract actions. Machine Learning: Proceedings of the Fifteenth International Conference (ICML98), Morgan Kaufmann, Madison, pp. 556–564.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Authors

William Uther
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Uther, W. (2011). Markov Decision Processes. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_512

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_512
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics