On the Worst-case Analysis of Temporal-difference Learning Algorithms

doi:10.1016/B978-1-55860-335-6.50040-4

Machine Learning Proceedings 1994

Proceedings of the Eleventh International Conference, Rutgers University, New Brunswick, NJ, July 10–13, 1994

1994, Pages 266-274

https://doi.org/10.1016/B978-1-55860-335-6.50040-4 Get rights and content

Abstract

We study the worst-case behavior of a family of learning algorithms based on Sutton's [7] method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton's so-called TD(A) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner's observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.

References (0)

Cited by (5)

Exponentiated Gradient versus Gradient Descent for Linear Predictors
1997, Information and Computation
We consider two algorithms for on-line prediction based on a linear model. The algorithms are the well-known gradient descent (GD) algorithm and a new algorithm, which we call EG^±. They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the gradient of the squared error made on a prediction. The EG^±algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worst-case loss bounds for EG^±and compare them to previously known bounds for the GD algorithm. The bounds suggest that the losses of the algorithms are in general incomparable, but EG^±has a much smaller loss if only few components of the input are relevant for the predictions. We have performed experiments which show that our worst-case upper bounds are quite tight already on simple artificial data.
Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion
2018, IEEE Transactions on Neural Networks and Learning Systems
Guided Policy Exploration for Markov Decision Processes using an Uncertainty-Based Value-of-Information Criterion
2018, arXiv
Sparse Q-learning with mirror descent
2012, Uncertainty in Artificial Intelligence - Proceedings of the 28th Conference, UAI 2012
Near-optimal reinforcement learning in polynomial time
2002, Machine Learning

View full text