Abstract
We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent’s observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ay, N., Montúfar, G., Rauh, J.: Selection criteria for neuromanifolds of stochastic dynamics. In: Yamaguchi, Y. (ed.) Advances in Cognitive Neurodynamics (III), pp. 147–154. Springer, Dordrecht (2013). doi:10.1007/978-94-007-4792-0_20
Hutter, M.: General discounting versus average reward. In: Balcázar, J.L., Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS, vol. 4264, pp. 244–258. Springer, Heidelberg (2006). doi:10.1007/11894841_21
Kakade, S.: Optimizing average reward using discounted rewards. In: Helmbold, D., Williamson, B. (eds.) COLT 2001. LNCS, vol. 2111, pp. 605–615. Springer, Heidelberg (2001). doi:10.1007/3-540-44581-1_40
Montúfar, G., Ghazi-Zahedi, K., Ay, N.: Geometry and determinism of optimal stationary control in partially observable Markov decision processes. arXiv:1503.07206 (2015)
Ross, S.M.: Introduction to Stochastic Dynamic Programming. Academic Press Inc., Cambridge (1983)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12, pp. 1057–1063. MIT Press (2000)
Tsitsiklis, J.N., Van Roy, B.: On average versus discounted reward temporal-difference learning. Mach. Learn. 49(2), 179–191 (2002)
Acknowledgment
We thank Nihat Ay for support and insightful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Montúfar, G., Rauh, J. (2017). Geometry of Policy Improvement. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2017. Lecture Notes in Computer Science(), vol 10589. Springer, Cham. https://doi.org/10.1007/978-3-319-68445-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-68445-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68444-4
Online ISBN: 978-3-319-68445-1
eBook Packages: Computer ScienceComputer Science (R0)