Abstract
Formulating the problem facing an intelligent agent as a Markov decision process (MDP) is increasingly common in artificial intelligence, reinforcement learning, artificial life, and artificial neural networks. In this short paper we examine some of the reasons for the appeal of this framework. Foremost among these are its generality, simplicity, and emphasis on goal-directed interaction between the agent and its environment. MDPs may be becoming a common focal point for different approaches to understanding the mind. Finally, we speculate that this focus may be an enduring one insofar as many of the efforts to extend the MDP framework end up bringing a wider class of problems back within it.
Preview
Unable to display preview. Download preview PDF.
References
Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138.
Barto, A. G., Sutton, R. S., and Watkins, C. J. C. H. (1990). Learning and sequential decision making. In Gabriel, M. and Moore, J., editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 539–602. MIT Press, Cambridge, MA.
Bellman, R. E. (1957). A Markov decision process.Journal of Mathematical Mech., 6:679–684.
Boutilier, C., Dearden, R., and Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence.
Crites, R. H. and Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, M. E. H. editor, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1017–1023, Cambridge, MA. MIT Press.
Dean, T. L., Kaelbling, L. P., Kirman, J., and Nicholson, A. (1995).Planning under time constraints in stochastic domains. Artificial Intelligence, 76(12):35–74.
Houk, J. C., Adams, J. L., and Barto, A. G. (1995).A model of how the basal ganglia generates and uses neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., and Beiser, D. G., editors, Models of Information Processing in the Basal Ganglia, pages 249–270. MIT Press, Cambridge, MA.
McCallum, A. K. (1995). Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, Rochester.
Precup, D. and Sutton, R. S. (in preparation). Multi-time models for temporally abstract planning.
Santamaria, J. C., Sutton, R. S., and Ram, A. (1996). Experiments with reinforcement learning in problems with continuous state and action spaces. Technical Report UM-CS-1996-088, Department of Computer Science, University of Massachusetts, Amherst, MA 01003.
Schultz, W., Dayan, P., and Montague, P. R. (1997).A neural substrate of prediction and reward. Science, 275:1593–1598.
Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8:323–339.
Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learning without stateestimation in partially observable Markovian decision problems. In Cohen, W. W. and Hirsch, H., editors, Proceedings of the Eleventh International Conference on Machine Learning, pages 284–292, San Francisco, CA. Morgan Kaufmann.
Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learing with soft state aggregation. In G. Tesauro, D. Touretzky, T. L., editor, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 359–368, Cambridge, MA. MIT Press.
Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 531–539, San Francisco, CA. Morgan Kaufmann.
Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Sys tems: Proceedings of the 1995 Conference, pages 1038–1044, Cambridge, MA. MIT Press.
Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In Gabriel, M. and Moore, J., editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 497–537. MIT Press, Cambridge, MA.
Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press/Bradford Books, Cambridge, MA.
Tesauro, G. J. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58–68.
Tesauro, G. J. and Galperin, G. R. (1997). On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, Cambridge, MA. MIT Press.
Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N. (1996). A neurodynamic programming approach to retailer inventory management. Technical ] Report LIDS-P-?, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards.PhD thesis, Cambridge University, Cambridge, England.
Witten, I. H. (1977). Exploring, modelling and controlling discrete sequential environments. International Journal of Man-Machine Studies, 9:715–735.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sutton, R.S. (1997). On the significance of Markov decision processes. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, JD. (eds) Artificial Neural Networks — ICANN'97. ICANN 1997. Lecture Notes in Computer Science, vol 1327. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020167
Download citation
DOI: https://doi.org/10.1007/BFb0020167
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63631-1
Online ISBN: 978-3-540-69620-9
eBook Packages: Springer Book Archive