Abstract
Stochastic gradient descent (SGD) has been in the center of many advances in modern machine learning. SGD processes examples sequentially, updating a weight vector in the direction that would most reduce the loss for that example. In many applications, some examples are more important than others and, to capture this, each example is given a non-negative weight that modulates its impact. Unfortunately, if the importance weights are highly variable they can greatly exacerbate the difficulty of setting the step-size parameter of SGD. To ease this difficulty, Karampatziakis and Langford [6] developed a class of elegant algorithms that are much more robust in the face of highly variable importance weights in supervised learning. In this paper we extend their idea, which we call “sliding step”, to reinforcement learning, where importance weighting can be particularly variable due to the importance sampling involved in off-policy learning algorithms. We compare two alternative ways of doing the extension in the linear function approximation setting, then introduce specific sliding-step versions of the TD(0) and Emphatic TD(0) learning algorithms. We prove the convergence of our algorithms and demonstrate their effectiveness on both on-policy and off-policy problems. Overall, our new algorithms appear to be effective in bringing the robustness of the sliding-step technique from supervised learning to reinforcement learning.
Keywords
2nd Scaling-Up Reinforcement Learning (SURL) Workshop, IJCAI 2019.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning. In: Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 49–56 (2009)
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)
Freund, Y., Schapire, R.E.A.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1995)
Ghiassian, S., Patterson, A., White, M., Sutton, S. R., White, A.: Online off-policy prediction. ArXiv:1811.02597 (2018)
Huang, J., Alexander, J.S., Arthur, G., Karsten, M.B., Bernhard, S.: Correcting sample selection bias by unlabeled data. Adv. Neural Inf. Process. Syst. 19, 601–608 (2006)
Karampatziakis, N., Langford, J.: Online importance weight aware updates. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 392–399 (2011)
Precup, D., Sutton, R.S., Dasgupta, S.: Off-Policy temporal-difference learning with function approximation. In: Proceedings of the 18th International Conference on Machine Learning (ICML), pp. 417–424 (2001)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Sutton, R.S., et al.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 993–1000 (2009)
Sutton, R.S., Maei, H.R., Szepesvári, C.: A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Advances in Neural Information Processing Systems 21 (NIPS), pp. 1609–1616 (2008)
Sutton, R.S., Mahmood, R.A., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(73), 1–29 (2016)
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-Learning. Mach. Learn. 16(3), 185–202 (1994)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5), 674–690 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tian, T., Sutton, R.S. (2020). Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning. In: El Fallah Seghrouchni, A., Sarne, D. (eds) Artificial Intelligence. IJCAI 2019 International Workshops. IJCAI 2019. Lecture Notes in Computer Science(), vol 12158. Springer, Cham. https://doi.org/10.1007/978-3-030-56150-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-56150-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-56149-9
Online ISBN: 978-3-030-56150-5
eBook Packages: Computer ScienceComputer Science (R0)