Skip to main content

Open Theoretical Questions in Reinforcement Learning

  • Conference paper
  • First Online:
Computational Learning Theory (EuroCOLT 1999)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1572))

Included in the following conference series:

Abstract

Reinforcement learning (RL) concerns the problem of a learning agent interacting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to behave in order to get the most reward. The environment is a Markov decision process (MDP) with state set, \( \mathcal{S} \), and action set, \( \mathcal{A} \). The agent and the environment interact in a sequence of discrete steps, t = 0, 1, 2,... The state and action at one time step, \( s_t \in \mathcal{S} \) and \( a_t \in \mathcal{A} \), determine the probability distribution for the state at the next time step, \( s_{t + 1} \in \mathcal{S} \) and, jointly, the distribution for the next reward, r t+1 ∈ ℜ. The agent’s objective is to chose each aint to maximize the subsequent return:

$$ R_t = \sum\limits_{k = 0}^\infty {\gamma ^k r_{t + 1 + k} ,} $$

where the discount rate, 0 ≤ γ ≤ 1, determines the relative weighting of immediate and delayed rewards. In some environments, the interaction consists of a sequence of episodes, each starting in a given state and ending upon arrival in a terminal state, terminating the series above. In other cases the interaction is continual, without interruption, and the sum may have an infinite number of terms (in which case we usually assume γ < 1). Infinite horizon cases with γ = 1 are also possible though less common (e.g., see Mahadevan, 1996).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baird, L.C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann, San Francisco.

    Google Scholar 

  • Bertsekas, D.P., and Tsitsiklis, J.N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.

    MATH  Google Scholar 

  • Crites, R.H., and Barto, A.G.(1996). Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1017–1023. MIT Press, Cambridge, MA.

    Google Scholar 

  • Kearns, M., Mansour, Y., Ng, A.Y. (in prep.). Sparse sampling methods for planning and learning in large and partially observable Markov decision processes.

    Google Scholar 

  • Loch J., and Singh S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco.

    Google Scholar 

  • Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:159–196.

    Google Scholar 

  • Moore, A.W., and Atkeson, C.G.(1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130.

    Google Scholar 

  • Singh, S.P.(1993). Learning to Solve Markovian Decision Processes. Ph.D. thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 93-77.

    Google Scholar 

  • Singh, S.P., and Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, pp. 974–980. MIT Press, Cambridge, MA.

    Google Scholar 

  • Singh S., and Dayan P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning.

    Google Scholar 

  • Singh, S.P., and Sutton, R.S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158.

    MATH  Google Scholar 

  • Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis, University of Massachusetts, Amherst.

    Google Scholar 

  • Sutton, R.S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1038–1044. MIT Press, Cambridge, MA.

    Google Scholar 

  • Sutton, R.S., and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.

    Google Scholar 

  • Tesauro, G.J. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58–68.

    Article  Google Scholar 

  • Tsitsiklis, J.N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690.

    Article  MATH  Google Scholar 

  • Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.

    Google Scholar 

  • Watkins, C.J.C.H., and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sutton, R.S. (1999). Open Theoretical Questions in Reinforcement Learning. In: Fischer, P., Simon, H.U. (eds) Computational Learning Theory. EuroCOLT 1999. Lecture Notes in Computer Science(), vol 1572. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49097-3_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-49097-3_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65701-9

  • Online ISBN: 978-3-540-49097-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics