Abstract
Q(λ)-learning uses TD(λ)-methods to accelerate Q-Learning. The worst case complexity for a single update step of previous online Q(λ) implementations based on lookup-tables is bounded by the size of the state/action space. Our faster algorithm's worst case complexity is bounded by the number of actions. The algorithm is based on the observation that Q-value updates may be postponed until they are needed.
Chapter PDF
References
Albus, J. S. (1975). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Dynamic Systems, Measurement and Control, pages 220–227.
Atkeson, C. G., Schaal, S., and Moore, A. W. (1997). Locally weighted learning. Artificial Intelligence Review, 11:11–73.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neurodynamic Programming. Athena Scientific, Belmont, MA.
Caironi, P. V. C. and Dorigo, M. (1994). Training Q-agents. Technical Report IRIDIA-94-14, Université Libre de Bruxelles.
Cichosz, P. (1995). Truncating temporal differences: On the efficient implementation of TD(λ) for reinforcement learning. Journal on Artificial Intelligence, 2:287–318.
Koenig, S. and Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement learning algorithms. Machine Learning, 22:228–250.
Kohonen, T. (1988). Self-Organization and Associative Memory. Springer, second edition.
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh.
Peng, J. and Williams, R. (1996). Incremental multi-step Q-learning. Machine Learning, 22:283–290.
Rummery, G. and Niranjan, M. (1994). On-line Q-learning using connectionist sytems. Technical Report CUED/F-INFENG-TR 166, Cambridge University, UK.
Singh, S. and Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44.
Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. M. and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 1038–1045. MIT Press, Cambridge MA.
Tesauro, G. (1992). Practical issues in temporal difference learning. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages 259–266. San Mateo, CA: Morgan Kaufmann.
Thrun, S. (1992). Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie-Mellon University.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge, England.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.
Whitehead, S. (1992). Reinforcement Learning for the adaptive control of perception and action. PhD thesis, University of Rochester.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wiering, M., Schmidhuber, J. (1998). Speeding up Q(λ)-learning. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026706
Download citation
DOI: https://doi.org/10.1007/BFb0026706
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive