Skip to main content

Markov Decision Processes

  • Reference work entry
Encyclopedia of Machine Learning
  • 649 Accesses

Synonyms

Policy search

Definition

A Markov Decision Process (MDP) is a discrete, stochastic, and generally finite model of a system to which some external control can be applied. Originally developed in the Operations Research and Statistics communities, MDPs, and their extension to Partially Observable Markov Decision Processes (POMDPs), are now commonly used in the study of reinforcement learning in the Artificial Intelligence and Robotics communities (Bellman, 1957; Bertsekas & Tsitsiklis, 1996Howard, 1960; Puterman, 1994; ). When used for reinforcement learning, firstly the parameters of an MDP are learned from data, and then the MDP is processed to choose a behavior.

Formally, an MDP is defined as a tuple: \(<\mathcal{S},\mathcal{A},T,R>\), where \(\mathcal{S}\) is a discrete set of states, \(\mathcal{A}\) is a discrete set of actions, \(T : \mathcal{S}\times \mathcal{A}\rightarrow (\mathcal{S}\rightarrow \mathbb{R})\) is a stochastic transition function, and \(R :...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  • Albus, J. S. (1981). Brains, behavior, and robotics. Peterborough: BYTE. ISBN: 0070009759.

    Google Scholar 

  • Andre, D., Friedman, N., & Parr, R. (1997). Generalized prioritized sweeping. Neural and Information Processing Systems, pp. 1001–1007.

    Google Scholar 

  • Andre, D., Russell, S. J. (2002). State abstraction for programmable reinforcement learning agents. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI).

    Google Scholar 

  • Baird, L. C. (1995). Residual algorithms: reinforcement learning with function approximation. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (ICML95) (pp. 30–37). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.

    MATH  Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming.

    Google Scholar 

  • Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.

    MATH  MathSciNet  Google Scholar 

  • Gordon, G. J. (1995). Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University.

    Google Scholar 

  • Guestrin, C., et al. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19, 399–468.

    MATH  MathSciNet  Google Scholar 

  • Hansen, E. A., & Zilberstein, S. (1998). Heuristic search in cyclic AND/OR graphs. Proceedings of the Fifteenth National Conference on Artificial Intelligence. http://rbr.cs.umass.edu/shlomo/papers/HZaaai98.html

  • Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. European Conference on Machine Learning (ECML). Lecture Notes in Computer Science 4212, Springer, pp. 282–293.

    Google Scholar 

  • Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13, 103–130.

    Google Scholar 

  • Moore, A. W., Baird, L., & Pack Kaelbling, L. (1999). Multi-value-functions: efficient automatic action hierarchies for multiple goal MDPs. International Joint Conference on Artificial Intelligence (IJCAI99).

    Google Scholar 

  • Munos, R., & Moore, A. W. (2001). Variable resolution discretization in optimal control. Machine Learning, 1, 1–31.

    Google Scholar 

  • Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley. ISBN: 0-471-61977-9.

    Google Scholar 

  • St-Aubin, R., Hoey, J., & Boutilier, C. (2000). APRICODD: approximate policy construction using decision diagrams. NIPS-2000.

    Google Scholar 

  • Sutton, R. S., Precup, D., & Singh, S. (1998). Intra-option learning about temporally abstract actions. Machine Learning: Proceedings of the Fifteenth International Conference (ICML98), Morgan Kaufmann, Madison, pp. 556–564.

    Google Scholar 

  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Uther, W. (2011). Markov Decision Processes. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_512

Download citation

Publish with us

Policies and ethics